This action might not be possible to undo. Are you sure you want to continue?
TO
MACHINE LEARNING
AN EARLY DRAFT OF A PROPOSED
TEXTBOOK
Nils J. Nilsson
Robotics Laboratory
Department of Computer Science
Stanford University
Stanford, CA 94305
email: nilsson@cs.stanford.edu
November 3, 1998
Copyright c 2005 Nils J. Nilsson
This material may not be copied, reproduced, or distributed without the
written permission of the copyright holder.
ii
Contents
1 Preliminaries 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is Machine Learning? . . . . . . . . . . . . . . . . . 1
1.1.2 Wellsprings of Machine Learning . . . . . . . . . . . . . . 3
1.1.3 Varieties of Machine Learning . . . . . . . . . . . . . . . . 4
1.2 Learning InputOutput Functions . . . . . . . . . . . . . . . . . . 5
1.2.1 Types of Learning . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Input Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Training Regimes . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . 9
1.3 Learning Requires Bias . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Sample Applications . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 13
2 Boolean Functions 15
2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Diagrammatic Representations . . . . . . . . . . . . . . . 16
2.2 Classes of Boolean Functions . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Terms and Clauses . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 DNF Functions . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 CNF Functions . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Decision Lists . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5 Symmetric and Voting Functions . . . . . . . . . . . . . . 23
2.2.6 Linearly Separable Functions . . . . . . . . . . . . . . . . 23
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 25
iii
3 Using Version Spaces for Learning 27
3.1 Version Spaces and Mistake Bounds . . . . . . . . . . . . . . . . 27
3.2 Version Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Learning as Search of a Version Space . . . . . . . . . . . . . . . 32
3.4 The Candidate Elimination Method . . . . . . . . . . . . . . . . 32
3.5 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 34
4 Neural Networks 35
4.1 Threshold Logic Units . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Deﬁnitions and Geometry . . . . . . . . . . . . . . . . . . 35
4.1.2 Special Cases of Linearly Separable Functions . . . . . . . 37
4.1.3 ErrorCorrection Training of a TLU . . . . . . . . . . . . 38
4.1.4 Weight Space . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.5 The WidrowHoﬀ Procedure . . . . . . . . . . . . . . . . . 42
4.1.6 Training a TLU on NonLinearlySeparable Training Sets 44
4.2 Linear Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Networks of TLUs . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Motivation and Examples . . . . . . . . . . . . . . . . . . 46
4.3.2 Madalines . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Piecewise Linear Machines . . . . . . . . . . . . . . . . . . 50
4.3.4 Cascade Networks . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Training Feedforward Networks by Backpropagation . . . . . . . 52
4.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 The Backpropagation Method . . . . . . . . . . . . . . . . 53
4.4.3 Computing Weight Changes in the Final Layer . . . . . . 56
4.4.4 Computing Changes to the Weights in Intermediate Layers 58
4.4.5 Variations on Backprop . . . . . . . . . . . . . . . . . . . 59
4.4.6 An Application: Steering a Van . . . . . . . . . . . . . . . 60
4.5 Synergies Between Neural Network and KnowledgeBased Methods 61
4.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 61
5 Statistical Learning 63
5.1 Using Statistical Decision Theory . . . . . . . . . . . . . . . . . . 63
5.1.1 Background and General Method . . . . . . . . . . . . . . 63
5.1.2 Gaussian (or Normal) Distributions . . . . . . . . . . . . 65
5.1.3 Conditionally Independent Binary Components . . . . . . 68
5.2 Learning Belief Networks . . . . . . . . . . . . . . . . . . . . . . 70
5.3 NearestNeighbor Methods . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 72
iv
6 Decision Trees 73
6.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Supervised Learning of Univariate Decision Trees . . . . . . . . . 74
6.2.1 Selecting the Type of Test . . . . . . . . . . . . . . . . . . 75
6.2.2 Using Uncertainty Reduction to Select Tests . . . . . . . 75
6.2.3 NonBinary Attributes . . . . . . . . . . . . . . . . . . . . 79
6.3 Networks Equivalent to Decision Trees . . . . . . . . . . . . . . . 79
6.4 Overﬁtting and Evaluation . . . . . . . . . . . . . . . . . . . . . 80
6.4.1 Overﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4.2 Validation Methods . . . . . . . . . . . . . . . . . . . . . 81
6.4.3 Avoiding Overﬁtting in Decision Trees . . . . . . . . . . . 82
6.4.4 MinimumDescription Length Methods . . . . . . . . . . . 83
6.4.5 Noise in Data . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 The Problem of Replicated Subtrees . . . . . . . . . . . . . . . . 84
6.6 The Problem of Missing Attributes . . . . . . . . . . . . . . . . . 86
6.7 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 87
7 Inductive Logic Programming 89
7.1 Notation and Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . 90
7.2 A Generic ILP Algorithm . . . . . . . . . . . . . . . . . . . . . . 91
7.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Inducing Recursive Programs . . . . . . . . . . . . . . . . . . . . 98
7.5 Choosing Literals to Add . . . . . . . . . . . . . . . . . . . . . . 100
7.6 Relationships Between ILP and Decision Tree Induction . . . . . 101
7.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 104
8 Computational Learning Theory 107
8.1 Notation and Assumptions for PAC Learning Theory . . . . . . . 107
8.2 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.1 The Fundamental Theorem . . . . . . . . . . . . . . . . . 109
8.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.2.3 Some Properly PACLearnable Classes . . . . . . . . . . . 112
8.3 The VapnikChervonenkis Dimension . . . . . . . . . . . . . . . . 113
8.3.1 Linear Dichotomies . . . . . . . . . . . . . . . . . . . . . . 113
8.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3.3 A More General Capacity Result . . . . . . . . . . . . . . 116
8.3.4 Some Facts and Speculations About the VC Dimension . 117
8.4 VC Dimension and PAC Learning . . . . . . . . . . . . . . . . . 118
8.5 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 118
v
9 Unsupervised Learning 119
9.1 What is Unsupervised Learning? . . . . . . . . . . . . . . . . . . 119
9.2 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.2.1 A Method Based on Euclidean Distance . . . . . . . . . . 120
9.2.2 A Method Based on Probabilities . . . . . . . . . . . . . . 124
9.3 Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . 125
9.3.1 A Method Based on Euclidean Distance . . . . . . . . . . 125
9.3.2 A Method Based on Probabilities . . . . . . . . . . . . . . 126
9.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 130
10 TemporalDiﬀerence Learning 131
10.1 Temporal Patterns and Prediction Problems . . . . . . . . . . . . 131
10.2 Supervised and TemporalDiﬀerence Methods . . . . . . . . . . . 131
10.3 Incremental Computation of the (∆W)
i
. . . . . . . . . . . . . . 134
10.4 An Experiment with TD Methods . . . . . . . . . . . . . . . . . 135
10.5 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.6 IntraSequence Weight Updating . . . . . . . . . . . . . . . . . . 138
10.7 An Example Application: TDgammon . . . . . . . . . . . . . . . 140
10.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 141
11 DelayedReinforcement Learning 143
11.1 The General Problem . . . . . . . . . . . . . . . . . . . . . . . . 143
11.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.3 Temporal Discounting and Optimal Policies . . . . . . . . . . . . 145
11.4 QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.5 Discussion, Limitations, and Extensions of QLearning . . . . . . 150
11.5.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 150
11.5.2 Using Random Actions . . . . . . . . . . . . . . . . . . . 152
11.5.3 Generalizing Over Inputs . . . . . . . . . . . . . . . . . . 153
11.5.4 Partially Observable States . . . . . . . . . . . . . . . . . 154
11.5.5 Scaling Problems . . . . . . . . . . . . . . . . . . . . . . . 154
11.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 155
vi
12 ExplanationBased Learning 157
12.1 Deductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 157
12.2 Domain Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
12.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.4 Evaluable Predicates . . . . . . . . . . . . . . . . . . . . . . . . . 162
12.5 More General Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.6 Utility of EBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.7.1 MacroOperators in Planning . . . . . . . . . . . . . . . . 164
12.7.2 Learning Search Control Knowledge . . . . . . . . . . . . 167
12.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 168
vii
viii
Preface
These notes are in the process of becoming a textbook. The process is quite
unﬁnished, and the author solicits corrections, criticisms, and suggestions from
students and other readers. Although I have tried to eliminate errors, some un
doubtedly remain—caveat lector. Many typographical infelicities will no doubt
persist until the ﬁnal version. More material has yet to be added. Please let Some of my plans for additions and
other reminders are mentioned in
marginal notes.
me have your suggestions about topics that are too important to be left out.
I hope that future versions will cover Hopﬁeld nets, Elman nets and other re
current nets, radial basis functions, grammar and automata learning, genetic
algorithms, and Bayes networks . . .. I am also collecting exercises and project
suggestions which will appear in future versions.
My intention is to pursue a middle ground between a theoretical textbook
and one that focusses on applications. The book concentrates on the important
ideas in machine learning. I do not give proofs of many of the theorems that I
state, but I do give plausibility arguments and citations to formal proofs. And, I
do not treat many matters that would be of practical importance in applications;
the book is not a handbook of machine learning practice. Instead, my goal is
to give the reader suﬃcient preparation to make the extensive literature on
machine learning accessible.
Students in my Stanford courses on machine learning have already made
several useful suggestions, as have my colleague, Pat Langley, and my teaching
assistants, Ron Kohavi, Karl Pﬂeger, Robert Allen, and Lise Getoor.
ix
Chapter 1
Preliminaries
1.1 Introduction
1.1.1 What is Machine Learning?
Learning, like intelligence, covers such a broad range of processes that it is dif
ﬁcult to deﬁne precisely. A dictionary deﬁnition includes phrases such as “to
gain knowledge, or understanding of, or skill in, by study, instruction, or expe
rience,” and “modiﬁcation of a behavioral tendency by experience.” Zoologists
and psychologists study learning in animals and humans. In this book we fo
cus on learning in machines. There are several parallels between animal and
machine learning. Certainly, many techniques in machine learning derive from
the eﬀorts of psychologists to make more precise their theories of animal and
human learning through computational models. It seems likely also that the
concepts and techniques being explored by researchers in machine learning may
illuminate certain aspects of biological learning.
As regards machines, we might say, very broadly, that a machine learns
whenever it changes its structure, program, or data (based on its inputs or in
response to external information) in such a manner that its expected future
performance improves. Some of these changes, such as the addition of a record
to a data base, fall comfortably within the province of other disciplines and are
not necessarily better understood for being called learning. But, for example,
when the performance of a speechrecognition machine improves after hearing
several samples of a person’s speech, we feel quite justiﬁed in that case to say
that the machine has learned.
Machine learning usually refers to the changes in systems that perform tasks
associated with artiﬁcial intelligence (AI). Such tasks involve recognition, diag
nosis, planning, robot control, prediction, etc. The “changes” might be either
enhancements to already performing systems or ab initio synthesis of new sys
tems. To be slightly more speciﬁc, we show the architecture of a typical AI
1
2 CHAPTER 1. PRELIMINARIES
“agent” in Fig. 1.1. This agent perceives and models its environment and com
putes appropriate actions, perhaps by anticipating their eﬀects. Changes made
to any of the components shown in the ﬁgure might count as learning. Diﬀerent
learning mechanisms might be employed depending on which subsystem is being
changed. We will study several diﬀerent learning methods in this book.
Sensory signals
Perception
Actions
Action
Computation
Model
Planning and
Reasoning
Goals
Figure 1.1: An AI System
One might ask “Why should machines have to learn? Why not design ma
chines to perform as desired in the ﬁrst place?” There are several reasons why
machine learning is important. Of course, we have already mentioned that the
achievement of learning in machines might help us understand how animals and
humans learn. But there are important engineering reasons as well. Some of
these are:
• Some tasks cannot be deﬁned well except by example; that is, we might be
able to specify input/output pairs but not a concise relationship between
inputs and desired outputs. We would like machines to be able to adjust
their internal structure to produce correct outputs for a large number of
sample inputs and thus suitably constrain their input/output function to
approximate the relationship implicit in the examples.
• It is possible that hidden among large piles of data are important rela
tionships and correlations. Machine learning methods can often be used
to extract these relationships (data mining).
1.1. INTRODUCTION 3
• Human designers often produce machines that do not work as well as
desired in the environments in which they are used. In fact, certain char
acteristics of the working environment might not be completely known
at design time. Machine learning methods can be used for onthejob
improvement of existing machine designs.
• The amount of knowledge available about certain tasks might be too large
for explicit encoding by humans. Machines that learn this knowledge
gradually might be able to capture more of it than humans would want to
write down.
• Environments change over time. Machines that can adapt to a changing
environment would reduce the need for constant redesign.
• New knowledge about tasks is constantly being discovered by humans.
Vocabulary changes. There is a constant stream of new events in the
world. Continuing redesign of AI systems to conform to new knowledge is
impractical, but machine learning methods might be able to track much
of it.
1.1.2 Wellsprings of Machine Learning
Work in machine learning is now converging from several sources. These dif
ferent traditions each bring diﬀerent methods and diﬀerent vocabulary which
are now being assimilated into a more uniﬁed discipline. Here is a brief listing
of some of the separate disciplines that have contributed to machine learning;
more details will follow in the the appropriate chapters:
• Statistics: A longstanding problem in statistics is how best to use sam
ples drawn from unknown probability distributions to help decide from
which distribution some new sample is drawn. A related problem is how
to estimate the value of an unknown function at a new point given the
values of this function at a set of sample points. Statistical methods
for dealing with these problems can be considered instances of machine
learning because the decision and estimation rules depend on a corpus of
samples drawn from the problem environment. We will explore some of
the statistical methods later in the book. Details about the statistical the
ory underlying these methods can be found in statistical textbooks such
as [Anderson, 1958].
• Brain Models: Nonlinear elements with weighted inputs
have been suggested as simple models of biological neu
rons. Networks of these elements have been studied by sev
eral researchers including [McCulloch & Pitts, 1943, Hebb, 1949,
Rosenblatt, 1958] and, more recently by [Gluck & Rumelhart, 1989,
Sejnowski, Koch, & Churchland, 1988]. Brain modelers are interested
in how closely these networks approximate the learning phenomena of
4 CHAPTER 1. PRELIMINARIES
living brains. We shall see that several important machine learning
techniques are based on networks of nonlinear elements—often called
neural networks. Work inspired by this school is sometimes called
connectionism, brainstyle computation, or subsymbolic processing.
• Adaptive Control Theory: Control theorists study the problem of con
trolling a process having unknown parameters which must be estimated
during operation. Often, the parameters change during operation, and the
control process must track these changes. Some aspects of controlling a
robot based on sensory inputs represent instances of this sort of problem.
For an introduction see [Bollinger & Duﬃe, 1988].
• Psychological Models: Psychologists have studied the performance of
humans in various learning tasks. An early example is the EPAM net
work for storing and retrieving one member of a pair of words when
given another [Feigenbaum, 1961]. Related work led to a number of
early decision tree [Hunt, Marin, & Stone, 1966] and semantic network
[Anderson & Bower, 1973] methods. More recent work of this sort has
been inﬂuenced by activities in artiﬁcial intelligence which we will be pre
senting.
Some of the work in reinforcement learning can be traced to eﬀorts to
model how reward stimuli inﬂuence the learning of goalseeking behavior in
animals [Sutton & Barto, 1987]. Reinforcement learning is an important
theme in machine learning research.
• Artiﬁcial Intelligence: From the beginning, AI research has been con
cerned with machine learning. Samuel developed a prominent early pro
gram that learned parameters of a function for evaluating board posi
tions in the game of checkers [Samuel, 1959]. AI researchers have also
explored the role of analogies in learning [Carbonell, 1983] and how fu
ture actions and decisions can be based on previous exemplary cases
[Kolodner, 1993]. Recent work has been directed at discovering rules
for expert systems using decisiontree methods [Quinlan, 1990] and in
ductive logic programming [Muggleton, 1991, Lavraˇc & Dˇzeroski, 1994].
Another theme has been saving and generalizing the results of prob
lem solving using explanationbased learning [DeJong & Mooney, 1986,
Laird, et al., 1986, Minton, 1988, Etzioni, 1993].
• Evolutionary Models:
In nature, not only do individual animals learn to perform better, but
species evolve to be better ﬁt in their individual niches. Since the distinc
tion between evolving and learning can be blurred in computer systems,
techniques that model certain aspects of biological evolution have been
proposed as learning methods to improve the performance of computer
programs. Genetic algorithms [Holland, 1975] and genetic programming
[Koza, 1992, Koza, 1994] are the most prominent computational tech
niques for evolution.
1.2. LEARNING INPUTOUTPUT FUNCTIONS 5
1.1.3 Varieties of Machine Learning
Orthogonal to the question of the historical source of any learning technique is
the more important question of what is to be learned. In this book, we take it
that the thing to be learned is a computational structure of some sort. We will
consider a variety of diﬀerent computational structures:
• Functions
• Logic programs and rule sets
• Finitestate machines
• Grammars
• Problem solving systems
We will present methods both for the synthesis of these structures from examples
and for changing existing structures. In the latter case, the change to the
existing structure might be simply to make it more computationally eﬃcient
rather than to increase the coverage of the situations it can handle. Much of
the terminology that we shall be using throughout the book is best introduced
by discussing the problem of learning functions, and we turn to that matter
ﬁrst.
1.2 Learning InputOutput Functions
We use Fig. 1.2 to help deﬁne some of the terminology used in describing the
problem of learning a function. Imagine that there is a function, f, and the task
of the learner is to guess what it is. Our hypothesis about the function to be
learned is denoted by h. Both f and h are functions of a vectorvalued input
X = (x
1
, x
2
, . . . , x
i
, . . . , x
n
) which has n components. We think of h as being
implemented by a device that has X as input and h(X) as output. Both f and
h themselves may be vectorvalued. We assume a priori that the hypothesized
function, h, is selected from a class of functions H. Sometimes we know that
f also belongs to this class or to a subset of this class. We select h based on a
training set, Ξ, of m input vector examples. Many important details depend on
the nature of the assumptions made about all of these entities.
1.2.1 Types of Learning
There are two major settings in which we wish to learn a function. In one,
called supervised learning, we know (sometimes only approximately) the values
of f for the m samples in the training set, Ξ. We assume that if we can ﬁnd
a hypothesis, h, that closely agrees with f for the members of Ξ, then this
hypothesis will be a good guess for f—especially if Ξ is large.
6 CHAPTER 1. PRELIMINARIES
h(X)
h
U = {X
1
, X
2
, . . . X
i
, . . ., X
m
}
Training Set:
X =
x
1
.
.
.
x
i
.
.
.
x
n
h D H
Figure 1.2: An InputOutput Function
Curveﬁtting is a simple example of supervised learning of a function. Sup
pose we are given the values of a twodimensional function, f, at the four sample
points shown by the solid circles in Fig. 1.3. We want to ﬁt these four points
with a function, h, drawn from the set, H, of seconddegree functions. We show
there a twodimensional parabolic surface above the x
1
, x
2
plane that ﬁts the
points. This parabolic function, h, is our hypothesis about the function, f, that
produced the four samples. In this case, h = f at the four samples, but we need
not have required exact matches.
In the other setting, termed unsupervised learning, we simply have a train
ing set of vectors without function values for them. The problem in this case,
typically, is to partition the training set into subsets, Ξ
1
, . . . , Ξ
R
, in some ap
propriate way. (We can still regard the problem as one of learning a function;
the value of the function is the name of the subset to which an input vector be
longs.) Unsupervised learning methods have application in taxonomic problems
in which it is desired to invent ways to classify data into meaningful categories.
We shall also describe methods that are intermediate between supervised
and unsupervised learning.
We might either be trying to ﬁnd a new function, h, or to modify an existing
one. An interesting special case is that of changing an existing function into an
equivalent one that is computationally more eﬃcient. This type of learning is
sometimes called speedup learning. A very simple example of speedup learning
involves deduction processes. From the formulas A ⊃ B and B ⊃ C, we can
deduce C if we are given A. From this deductive process, we can create the
formula A ⊃ C—a new formula but one that does not sanction any more con
1.2. LEARNING INPUTOUTPUT FUNCTIONS 7
10
5
0
5
10
10
5
0
5
10
0
500
1000
1500
10
5
0
5
10
10
5
0
5
10
0
00
00
0
x
1
x
2
h
sample fvalue
Figure 1.3: A Surface that Fits Four Points
clusions than those that could be derived from the formulas that we previously
had. But with this new formula we can derive C more quickly, given A, than
we could have done before. We can contrast speedup learning with methods
that create genuinely new functions—ones that might give diﬀerent results after
learning than they did before. We say that the latter methods involve inductive
learning. As opposed to deduction, there are no correct inductions—only useful
ones.
1.2.2 Input Vectors
Because machine learning methods derive from so many diﬀerent traditions, its
terminology is rife with synonyms, and we will be using most of them in this
book. For example, the input vector is called by a variety of names. Some
of these are: input vector, pattern vector, feature vector, sample, example, and
instance. The components, x
i
, of the input vector are variously called features,
attributes, input variables, and components.
The values of the components can be of three main types. They might
be realvalued numbers, discretevalued numbers, or categorical values. As an
example illustrating categorical values, information about a student might be
represented by the values of the attributes class, major, sex, adviser. A par
ticular student would then be represented by a vector such as: (sophomore,
history, male, higgins). Additionally, categorical values may be ordered (as in
{small, medium, large}) or unordered (as in the example just given). Of course,
mixtures of all these types of values are possible.
In all cases, it is possible to represent the input in unordered form by listing
the names of the attributes together with their values. The vector form assumes
that the attributes are ordered and given implicitly by a form. As an example
of an attributevalue representation, we might have: (major: history, sex: male,
8 CHAPTER 1. PRELIMINARIES
class: sophomore, adviser: higgins, age: 19). We will be using the vector form
exclusively.
An important specialization uses Boolean values, which can be regarded as
a special case of either discrete numbers (1,0) or of categorical variables (True,
False).
1.2.3 Outputs
The output may be a real number, in which case the process embodying the
function, h, is called a function estimator, and the output is called an output
value or estimate.
Alternatively, the output may be a categorical value, in which case the pro
cess embodying h is variously called a classiﬁer, a recognizer, or a categorizer,
and the output itself is called a label, a class, a category, or a decision. Classi
ﬁers have application in a number of recognition problems, for example in the
recognition of handprinted characters. The input in that case is some suitable
representation of the printed character, and the classiﬁer maps this input into
one of, say, 64 categories.
Vectorvalued outputs are also possible with components being real numbers
or categorical values.
An important special case is that of Boolean output values. In that case,
a training pattern having value 1 is called a positive instance, and a training
sample having value 0 is called a negative instance. When the input is also
Boolean, the classiﬁer implements a Boolean function. We study the Boolean
case in some detail because it allows us to make important general points in
a simpliﬁed setting. Learning a Boolean function is sometimes called concept
learning, and the function is called a concept.
1.2.4 Training Regimes
There are several ways in which the training set, Ξ, can be used to produce a
hypothesized function. In the batch method, the entire training set is available
and used all at once to compute the function, h. A variation of this method
uses the entire training set to modify a current hypothesis iteratively until an
acceptable hypothesis is obtained. By contrast, in the incremental method, we
select one member at a time from the training set and use this instance alone
to modify a current hypothesis. Then another member of the training set is
selected, and so on. The selection method can be random (with replacement)
or it can cycle through the training set iteratively. If the entire training set
becomes available one member at a time, then we might also use an incremental
method—selecting and using training set members as they arrive. (Alterna
tively, at any stage all training set members so far available could be used in a
“batch” process.) Using the training set members as they become available is
called an online method. Online methods might be used, for example, when the
1.3. LEARNING REQUIRES BIAS 9
next training instance is some function of the current hypothesis and the previ
ous instance—as it would be when a classiﬁer is used to decide on a robot’s next
action given its current set of sensory inputs. The next set of sensory inputs
will depend on which action was selected.
1.2.5 Noise
Sometimes the vectors in the training set are corrupted by noise. There are two
kinds of noise. Class noise randomly alters the value of the function; attribute
noise randomly alters the values of the components of the input vector. In either
case, it would be inappropriate to insist that the hypothesized function agree
precisely with the values of the samples in the training set.
1.2.6 Performance Evaluation
Even though there is no correct answer in inductive learning, it is important
to have methods to evaluate the result of learning. We will discuss this matter
in more detail later, but, brieﬂy, in supervised learning the induced function is
usually evaluated on a separate set of inputs and function values for them called
the testing set . A hypothesized function is said to generalize when it guesses
well on the testing set. Both meansquarederror and the total number of errors
are common measures.
1.3 Learning Requires Bias
Long before now the reader has undoubtedly asked why is learning a function
possible at all? Certainly, for example, there are an uncountable number of
diﬀerent functions having values that agree with the four samples shown in Fig.
1.3. Why would a learning procedure happen to select the quadratic one shown
in that ﬁgure? In order to make that selection we had at least to limit a priori
the set of hypotheses to quadratic functions and then to insist that the one we
chose passed through all four sample points. This kind of a priori information
is called bias, and useful learning without bias is impossible.
We can gain more insight into the role of bias by considering the special case
of learning a Boolean function of n dimensions. There are 2
n
diﬀerent Boolean
inputs possible. Suppose we had no bias; that is H is the set of all 2
2
n
Boolean
functions, and we have no preference among those that ﬁt the samples in the
training set. In this case, after being presented with one member of the training
set and its value we can rule out precisely onehalf of the members of H—those
Boolean functions that would misclassify this labeled sample. The remaining
functions constitute what is called a “version space;” we’ll explore that concept
in more detail later. As we present more members of the training set, the graph
of the number of hypotheses not yet ruled out as a function of the number of
diﬀerent patterns presented is as shown in Fig. 1.4. At any stage of the process,
10 CHAPTER 1. PRELIMINARIES
half of the remaining Boolean functions have value 1 and half have value 0 for
any training pattern not yet seen. No generalization is possible in this case
because the training patterns give no clue about the value of a pattern not yet
seen. Only memorization is possible here, which is a trivial sort of learning.
log
2
H
v

2
n
2
n
j = no. of labeled
patterns already seen
0
0
2
n
< j
(generalization is not possible)
H
v
 = no. of functions not ruled out
Figure 1.4: Hypotheses Remaining as a Function of Labeled Patterns Presented
But suppose we limited H to some subset, H
c
, of all Boolean functions.
Depending on the subset and on the order of presentation of training patterns,
a curve of hypotheses not yet ruled out might look something like the one
shown in Fig. 1.5. In this case it is even possible that after seeing fewer than
all 2
n
labeled samples, there might be only one hypothesis that agrees with
the training set. Certainly, even if there is more than one hypothesis remaining,
most of them may have the same value for most of the patterns not yet seen! The
theory of Probably Approximately Correct (PAC) learning makes this intuitive
idea precise. We’ll examine that theory later.
Let’s look at a speciﬁc example of how bias aids learning. A Boolean function
can be represented by a hypercube each of whose vertices represents a diﬀerent
input pattern. We show a 3dimensional version in Fig. 1.6. There, we show a
training set of six sample patterns and have marked those having a value of 1 by
a small square and those having a value of 0 by a small circle. If the hypothesis
set consists of just the linearly separable functions—those for which the positive
and negative instances can be separated by a linear surface, then there is only
one function remaining in this hypothsis set that is consistent with the training
set. So, in this case, even though the training set does not contain all possible
patterns, we can already pin down what the function must be—given the bias.
1.4. SAMPLE APPLICATIONS 11
log
2
H
v

2
n
2
n
j = no. of labeled
patterns already seen
0
0
H
v
 = no. of functions not ruled out
depends on order
of presentation
log
2
H
c

Figure 1.5: Hypotheses Remaining From a Restricted Subset
Machine learning researchers have identiﬁed two main varieties of bias, ab
solute and preference. In absolute bias (also called restricted hypothesisspace
bias), one restricts Hto a deﬁnite subset of functions. In our example of Fig. 1.6,
the restriction was to linearly separable Boolean functions. In preference bias,
one selects that hypothesis that is minimal according to some ordering scheme
over all hypotheses. For example, if we had some way of measuring the complex
ity of a hypothesis, we might select the one that was simplest among those that
performed satisfactorily on the training set. The principle of Occam’s razor,
used in science to prefer simple explanations to more complex ones, is a type
of preference bias. (William of Occam, 1285?1349, was an English philosopher
who said: “non sunt multiplicanda entia praeter necessitatem,” which means
“entities should not be multiplied unnecessarily.”)
1.4 Sample Applications
Our main emphasis in this book is on the concepts of machine learning—not
on its applications. Nevertheless, if these concepts were irrelevant to realworld
problems they would probably not be of much interest. As motivation, we give
a short summary of some areas in which machine learning techniques have been
successfully applied. [Langley, 1992] cites some of the following applications and
others:
a. Rule discovery using a variant of ID3 for a printing industry problem
12 CHAPTER 1. PRELIMINARIES
x
1
x
2
x
3
Figure 1.6: A Training Set That Completely Determines a Linearly Separable
Function
[Evans & Fisher, 1992].
b. Electric power load forecasting using a knearestneighbor rule system
[Jabbour, K., et al., 1987].
c. Automatic “help desk” assistant using a nearestneighbor system
[Acorn & Walden, 1992].
d. Planning and scheduling for a steel mill using ExpertEase, a marketed
(ID3like) system [Michie, 1992].
e. Classiﬁcation of stars and galaxies [Fayyad, et al., 1993].
Many applicationoriented papers are presented at the annual conferences
on Neural Information Processing Systems. Among these are papers on: speech
recognition, dolphin echo recognition, image processing, bioengineering, diag
nosis, commodity trading, face recognition, music composition, optical character
recognition, and various control applications [Various Editors, 19891994].
As additional examples, [Hammerstrom, 1993] mentions:
a. Sharp’s Japanese kanji character recognition system processes 200 char
acters per second with 99+% accuracy. It recognizes 3000+ characters.
b. NeuroForecasting Centre’s (London Business School and University Col
lege London) trading strategy selection network earned an average annual
proﬁt of 18% against a conventional system’s 12.3%.
1.5. SOURCES 13
c. Fujitsu’s (plus a partner’s) neural network for monitoring a continuous
steel casting operation has been in successful operation since early 1990.
In summary, it is rather easy nowadays to ﬁnd applications of machine learn
ing techniques. This fact should come as no surprise inasmuch as many machine
learning techniques can be viewed as extensions of well known statistical meth
ods which have been successfully applied for many years.
1.5 Sources
Besides the rich literature in machine learning (a small part of
which is referenced in the Bibliography), there are several text
books that are worth mentioning [Hertz, Krogh, & Palmer, 1991,
Weiss & Kulikowski, 1991, Natarjan, 1991, Fu, 1994, Langley, 1996].
[Shavlik & Dietterich, 1990, Buchanan & Wilkins, 1993] are edited vol
umes containing some of the most important papers. A survey paper by
[Dietterich, 1990] gives a good overview of many important topics. There are
also well established conferences and publications where papers are given and
appear including:
• The Annual Conferences on Advances in Neural Information Processing
Systems
• The Annual Workshops on Computational Learning Theory
• The Annual International Workshops on Machine Learning
• The Annual International Conferences on Genetic Algorithms
(The Proceedings of the abovelisted four conferences are published by
Morgan Kaufmann.)
• The journal Machine Learning (published by Kluwer Academic Publish
ers).
There is also much information, as well as programs and datasets, available over
the Internet through the World Wide Web.
1.6 Bibliographical and Historical Remarks
To be added. Every chapter will
contain a brief survey of the history
of the material covered in that
chapter.
14 CHAPTER 1. PRELIMINARIES
Chapter 2
Boolean Functions
2.1 Representation
2.1.1 Boolean Algebra
Many important ideas about learning of functions are most easily presented
using the special case of Boolean functions. There are several important sub
classes of Boolean functions that are used as hypothesis classes for function
learning. Therefore, we digress in this chapter to present a review of Boolean
functions and their properties. (For a more thorough treatment see, for example,
[Unger, 1989].)
A Boolean function, f(x
1
, x
2
, . . . , x
n
) maps an ntuple of (0,1) values to
{0, 1}. Boolean algebra is a convenient notation for representing Boolean func
tions. Boolean algebra uses the connectives ·, +, and . For example, the and
function of two variables is written x
1
· x
2
. By convention, the connective, “·”
is usually suppressed, and the and function is written x
1
x
2
. x
1
x
2
has value 1 if
and only if both x
1
and x
2
have value 1; if either x
1
or x
2
has value 0, x
1
x
2
has
value 0. The (inclusive) or function of two variables is written x
1
+x
2
. x
1
+x
2
has value 1 if and only if either or both of x
1
or x
2
has value 1; if both x
1
and
x
2
have value 0, x
1
+x
2
has value 0. The complement or negation of a variable,
x, is written x. x has value 1 if and only if x has value 0; if x has value 1, x has
value 0.
These deﬁnitions are compactly given by the following rules for Boolean
algebra:
1 + 1 = 1, 1 + 0 = 1, 0 + 0 = 0,
1 · 1 = 1, 1 · 0 = 0, 0 · 0 = 0, and
1 = 0, 0 = 1.
Sometimes the arguments and values of Boolean functions are expressed in
terms of the constants T (True) and F (False) instead of 1 and 0, respectively.
15
16 CHAPTER 2. BOOLEAN FUNCTIONS
The connectives · and + are each commutative and associative. Thus, for
example, x
1
(x
2
x
3
) = (x
1
x
2
)x
3
, and both can be written simply as x
1
x
2
x
3
.
Similarly for +.
A Boolean formula consisting of a single variable, such as x
1
is called an
atom. One consisting of either a single variable or its complement, such as x
1
,
is called a literal.
The operators · and + do not commute between themselves. Instead, we
have DeMorgan’s laws (which can be veriﬁed by using the above deﬁnitions):
x
1
x
2
= x
1
+x
2
, and
x
1
+x
2
= x
1
x
2
.
2.1.2 Diagrammatic Representations
We saw in the last chapter that a Boolean function could be represented by
labeling the vertices of a cube. For a function of n variables, we would need
an ndimensional hypercube. In Fig. 2.1 we show some 2 and 3dimensional
examples. Vertices having value 1 are labeled with a small square, and vertices
having value 0 are labeled with a small circle.
x
1
x
2
x
1
x
2
x
1
x
2
and
or
xor (exclusive or)
x
1
x
2
x
1
+ x
2
x
1
x
2
+ x
1
x
2
even parity function
x
1
x
2
x
3
x
1
x
2
x
3
+ x
1
x
2
x
3
+
x
1
x
2
x
3
+
x
1
x
2
x
3
Figure 2.1: Representing Boolean Functions on Cubes
Using the hypercube representations, it is easy to see how many Boolean
functions of n dimensions there are. A 3dimensional cube has 2
3
= 8 vertices,
and each may be labeled in two diﬀerent ways; thus there are 2
(2
3
)
= 256
2.2. CLASSES OF BOOLEAN FUNCTIONS 17
diﬀerent Boolean functions of 3 variables. In general, there are 2
2
n
Boolean
functions of n variables.
We will be using 2 and 3dimensional cubes later to provide some intuition
about the properties of certain Boolean functions. Of course, we cannot visualize
hypercubes (for n > 3), and there are many surprising properties of higher
dimensional spaces, so we must be careful in using intuitions gained in low
dimensions. One diagrammatic technique for dimensions slightly higher than
3 is the Karnaugh map. A Karnaugh map is an array of values of a Boolean
function in which the horizontal rows are indexed by the values of some of
the variables and the vertical columns are indexed by the rest. The rows and
columns are arranged in such a way that entries that are adjacent in the map
correspond to vertices that are adjacent in the hypercube representation. We
show an example of the 4dimensional even parity function in Fig. 2.2. (An
even parity function is a Boolean function that has value 1 if there are an even
number of its arguments that have value 1; otherwise it has value 0.) Note
that all adjacent cells in the table correspond to inputs diﬀering in only one
component. Also describe general logic
diagrams, [Wnek, et al., 1990].
00 01 10 11
00
01
10
11
1 1
1
1
1
1
1
1 0
0 0
0
0
0
0
0
x
1
,x
2
x
3
,x
4
Figure 2.2: A Karnaugh Map
2.2 Classes of Boolean Functions
2.2.1 Terms and Clauses
To use absolute bias in machine learning, we limit the class of hypotheses. In
learning Boolean functions, we frequently use some of the common subclasses of
those functions. Therefore, it will be important to know about these subclasses.
One basic subclass is called terms. A term is any function written in the
form l
1
l
2
· · · l
k
, where the l
i
are literals. Such a form is called a conjunction of
literals. Some example terms are x
1
x
7
and x
1
x
2
x
4
. The size of a term is the
number of literals it contains. The examples are of sizes 2 and 3, respectively.
(Strictly speaking, the class of conjunctions of literals is called the monomials,
18 CHAPTER 2. BOOLEAN FUNCTIONS
and a conjunction of literals itself is called a term. This distinction is a ﬁne one
which we elect to blur here.)
It is easy to show that there are exactly 3
n
possible terms of n variables.
The number of terms of size k or less is bounded from above by
¸
k
i=0
C(2n, i) =
O(n
k
), where C(i, j) =
i!
(i−j)!j!
is the binomial coeﬃcient. Probably I’ll put in a simple
termlearning algorithm here—so
we can get started on learning!
Also for DNF functions and
decision lists—as they are deﬁned
in the next few pages.
A clause is any function written in the form l
1
+l
2
+· · · +l
k
, where the l
i
are
literals. Such a form is called a disjunction of literals. Some example clauses
are x
3
+ x
5
+ x
6
and x
1
+ x
4
. The size of a clause is the number of literals it
contains. There are 3
n
possible clauses and fewer than
¸
k
i=0
C(2n, i) clauses of
size k or less. If f is a term, then (by De Morgan’s laws) f is a clause, and vice
versa. Thus, terms and clauses are duals of each other.
In psychological experiments, conjunctions of literals seem easier for humans
to learn than disjunctions of literals.
2.2.2 DNF Functions
A Boolean function is said to be in disjunctive normal form (DNF) if it can be
written as a disjunction of terms. Some examples in DNF are: f = x
1
x
2
+x
2
x
3
x
4
and f = x
1
x
3
+ x
2
x
3
+ x
1
x
2
x
3
. A DNF expression is called a kterm DNF
expression if it is a disjunction of k terms; it is in the class kDNF if the size of
its largest term is k. The examples above are 2term and 3term expressions,
respectively. Both expressions are in the class 3DNF.
Each term in a DNF expression for a function is called an implicant because
it “implies” the function (if the term has value 1, so does the function). In
general, a term, t, is an implicant of a function, f, if f has value 1 whenever
t does. A term, t, is a prime implicant of f if the term, t
, formed by taking
any literal out of an implicant t is no longer an implicant of f. (The implicant
cannot be “divided” by any term and remain an implicant.)
Thus, both x
2
x
3
and x
1
x
3
are prime implicants of f = x
2
x
3
+x
1
x
3
+x
2
x
1
x
3
,
but x
2
x
1
x
3
is not.
The relationship between implicants and prime implicants can be geometri
cally illustrated using the cube representation for Boolean functions. Consider,
for example, the function f = x
2
x
3
+ x
1
x
3
+ x
2
x
1
x
3
. We illustrate it in Fig.
2.3. Note that each of the three planes in the ﬁgure “cuts oﬀ” a group of
vertices having value 1, but none cuts oﬀ any vertices having value 0. These
planes are pictorial devices used to isolate certain lower dimensional subfaces
of the cube. Two of them isolate onedimensional edges, and the third isolates
a zerodimensional vertex. Each group of vertices on a subface corresponds to
one of the implicants of the function, f, and thus each implicant corresponds
to a subface of some dimension. A kdimensional subface corresponds to an
(n − k)size implicant term. The function is written as the disjunction of the
implicants—corresponding to the union of all the vertices cut oﬀ by all of the
planes. Geometrically, an implicant is prime if and only if its corresponding
subface is the largest dimensional subface that includes all of its vertices and
2.2. CLASSES OF BOOLEAN FUNCTIONS 19
no other vertices having value 0. Note that the term x
2
x
1
x
3
is not a prime
implicant of f. (In this case, we don’t even have to include this term in the
function because the vertex cut oﬀ by the plane corresponding to x
2
x
1
x
3
is
already cut oﬀ by the plane corresponding to x
2
x
3
.) The other two implicants
are prime because their corresponding subfaces cannot be expanded without
including vertices having value 0.
x
2
x
1
x
3
1, 0, 0
1, 0, 1
1, 1, 1
0, 0, 1
f = x
2
x
3
+ x
1
x
3
+ x
2
x
1
x
3
= x
2
x
3
+ x
1
x
3
x
2
x
3
and x
1
x
3
are prime implicants
Figure 2.3: A Function and its Implicants
Note that all Boolean functions can be represented in DNF—trivially by
disjunctions of terms of size n where each term corresponds to one of the vertices
whose value is 1. Whereas there are 2
2
n
functions of n dimensions in DNF (since
any Boolean function can be written in DNF), there are just 2
O(n
k
)
functions
in kDNF.
All Boolean functions can also be represented in DNF in which each term is
a prime implicant, but that representation is not unique, as shown in Fig. 2.4.
If we can express a function in DNF form, we can use the consensus method
to ﬁnd an expression for the function in which each term is a prime implicant.
The consensus method relies on two results: We may replace this section with
one describing the
QuineMcCluskey method instead.
• Consensus:
20 CHAPTER 2. BOOLEAN FUNCTIONS
x
2
x
1
x
3
1, 0, 0
1, 0, 1
1, 1, 1
0, 0, 1
f = x
2
x
3
+ x
1
x
3
+ x
1
x
2
= x
1
x
2
+ x
1
x
3
All of the terms are prime implicants, but there
is not a unique representation
Figure 2.4: NonUniqueness of Representation by Prime Implicants
x
i
· f
1
+x
i
· f
2
= x
i
· f
1
+x
i
· f
2
+f
1
· f
2
where f
1
and f
2
are terms such that no literal appearing in f
1
appears
complemented in f
2
. f
1
· f
2
is called the consensus of x
i
· f
1
and x
i
·
f
2
. Readers familiar with the resolution rule of inference will note that
consensus is the dual of resolution.
Examples: x
1
is the consensus of x
1
x
2
and x
1
x
2
. The terms x
1
x
2
and x
1
x
2
have no consensus since each term has more than one literal appearing
complemented in the other.
• Subsumption:
x
i
· f
1
+f
1
= f
1
where f
1
is a term. We say that f
1
subsumes x
i
· f
1
.
Example: x
1
x
4
x
5
subsumes x
1
x
4
x
2
x
5
2.2. CLASSES OF BOOLEAN FUNCTIONS 21
The consensus method for ﬁnding a set of prime implicants for a function,
f, iterates the following operations on the terms of a DNF expression for f until
no more such operations can be applied:
a. initialize the process with the set, T , of terms in the DNF expression of
f,
b. compute the consensus of a pair of terms in T and add the result to T ,
c. eliminate any terms in T that are subsumed by other terms in T .
When this process halts, the terms remaining in T are all prime implicants of
f.
Example: Let f = x
1
x
2
+x
1
x
2
x
3
+x
1
x
2
x
3
x
4
x
5
. We show a derivation of
a set of prime implicants in the consensus tree of Fig. 2.5. The circled numbers
adjoining the terms indicate the order in which the consensus and subsumption
operations were performed. Shaded boxes surrounding a term indicate that it
was subsumed. The ﬁnal form of the function in which all terms are prime
implicants is: f = x
1
x
2
+x
1
x
3
+x
1
x
4
x
5
. Its terms are all of the nonsubsumed
terms in the consensus tree.
x
1
x
2
x
1
x
2
x
3
x
1
x
2
x
3
x
4
x
5
x
1
x
3
x
1
x
2
x
4
x
5
x
1
x
4
x
5
f = x
1
x
2
+
+ x
1
x
3
x
1
x
4
x
5
1
2
6
4
5
3
Figure 2.5: A Consensus Tree
2.2.3 CNF Functions
Disjunctive normal form has a dual: conjunctive normal form (CNF). A Boolean
function is said to be in CNF if it can be written as a conjunction of clauses.
22 CHAPTER 2. BOOLEAN FUNCTIONS
An example in CNF is: f = (x
1
+x
2
)(x
2
+x
3
+x
4
). A CNF expression is called
a kclause CNF expression if it is a conjunction of k clauses; it is in the class
kCNF if the size of its largest clause is k. The example is a 2clause expression
in 3CNF. If f is written in DNF, an application of De Morgan’s law renders f
in CNF, and vice versa. Because CNF and DNF are duals, there are also 2
O(n
k
)
functions in kCNF.
2.2.4 Decision Lists
Rivest has proposed a class of Boolean functions called decision lists [Rivest, 1987].
A decision list is written as an ordered list of pairs:
(t
q
, v
q
)
(t
q−1
, v
q−1
)
· · ·
(t
i
, v
i
)
· · ·
(t
2
, v
2
)
(T, v
1
)
where the v
i
are either 0 or 1, the t
i
are terms in (x
1
, . . . , x
n
), and T is a term
whose value is 1 (regardless of the values of the x
i
). The value of a decision list
is the value of v
i
for the ﬁrst t
i
in the list that has value 1. (At least one t
i
will
have value 1, because the last one does; v
1
can be regarded as a default value of
the decision list.) The decision list is of size k, if the size of the largest term in
it is k. The class of decision lists of size k or less is called kDL.
An example decision list is:
f =
(x
1
x
2
, 1)
(x
1
x
2
x
3
, 0)
x
2
x
3
, 1)
(1, 0)
f has value 0 for x
1
= 0, x
2
= 0, and x
3
= 1. It has value 1 for x
1
= 1, x
2
= 0,
and x
3
= 1. This function is in 3DL.
It has been shown that the class kDL is a strict superset of the union of
kDNF and kCNF. There are 2
O[n
k
k log(n)]
functions in kDL [Rivest, 1987].
Interesting generalizations of decision lists use other Boolean functions in
place of the terms, t
i
. For example we might use linearly separable functions in
place of the t
i
(see below and [Marchand & Golea, 1993]).
2.2. CLASSES OF BOOLEAN FUNCTIONS 23
2.2.5 Symmetric and Voting Functions
A Boolean function is called symmetric if it is invariant under permutations
of the input variables. For example, any function that is dependent only on
the number of input variables whose values are 1 is a symmetric function. The
parity functions, which have value 1 depending on whether or not the number
of input variables with value 1 is even or odd is a symmetric function. (The
exclusive or function, illustrated in Fig. 2.1, is an oddparity function of two
dimensions. The or and and functions of two dimensions are also symmetric.)
An important subclass of the symmetric functions is the class of voting func
tions (also called mofn functions). A kvoting function has value 1 if and only
if k or more of its n inputs has value 1. If k = 1, a voting function is the same
as an nsized clause; if k = n, a voting function is the same as an nsized term;
if k = (n + 1)/2 for n odd or k = 1 + n/2 for n even, we have the majority
function.
2.2.6 Linearly Separable Functions
The linearly separable functions are those that can be expressed as follows:
f = thresh(
n
¸
i=1
w
i
x
i
, θ)
where w
i
, i = 1, . . . , n, are realvalued numbers called weights, θ is a realvalued
number called the threshold, and thresh(σ, θ) is 1 if σ ≥ θ and 0 otherwise.
(Note that the concept of linearly separable functions can be extended to non
Boolean inputs.) The kvoting functions are all members of the class of linearly
separable functions in which the weights all have unit value and the threshold
depends on k. Thus, terms and clauses are special cases of linearly separable
functions.
A convenient way to write linearly separable functions uses vector notation:
f = thresh(X· W, θ)
where X = (x
1
, . . . , x
n
) is an ndimensional vector of input variables, W =
(w
1
, . . . , w
n
) is an ndimensional vector of weight values, and X· W is the dot
(or inner) product of the two vectors. Input vectors for which f has value 1 lie
in a halfspace on one side of (and on) a hyperplane whose orientation is normal
to W and whose position (with respect to the origin) is determined by θ. We
saw an example of such a separating plane in Fig. 1.6. With this idea in mind,
it is easy to see that two of the functions in Fig. 2.1 are linearly separable, while
two are not. Also note that the terms in Figs. 2.3 and 2.4 are linearly separable
functions as evidenced by the separating planes shown.
There is no closedform expression for the number of linearly separable func
tions of n dimensions, but the following table gives the numbers for n up to 6.
24 CHAPTER 2. BOOLEAN FUNCTIONS
n Boolean Linearly Separable
Functions Functions
1 4 4
2 16 14
3 256 104
4 65,536 1,882
5 ≈ 4.3 ×10
9
94,572
6 ≈ 1.8 ×10
19
15,028,134
[Muroga, 1971] has shown that (for n > 1) there are no more than 2
n
2
linearly
separable functions of n dimensions. (See also [Winder, 1961, Winder, 1962].)
2.3 Summary
The diagram in Fig. 2.6 shows some of the set inclusions of the classes of Boolean
functions that we have considered. We will be confronting these classes again
in later chapters.
DNF
(All)
kDL
kDNF
ksize
terms
terms
lin sep
Figure 2.6: Classes of Boolean Functions
The sizes of the various classes are given in the following table (adapted from
[Dietterich, 1990, page 262]):
2.4. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 25
Class Size of Class
terms 3
n
clauses 3
n
kterm DNF 2
O(kn)
kclause CNF 2
O(kn)
kDNF 2
O(n
k
)
kCNF 2
O(n
k
)
kDL 2
O[n
k
k log(n)]
lin sep 2
O(n
2
)
DNF 2
2
n
2.4 Bibliographical and Historical Remarks
To be added.
26 CHAPTER 2. BOOLEAN FUNCTIONS
Chapter 3
Using Version Spaces for
Learning
3.1 Version Spaces and Mistake Bounds
The ﬁrst learning methods we present are based on the concepts of version
spaces and version graphs. These ideas are most clearly explained for the case
of Boolean function learning. Given an initial hypothesis set H (a subset of
all Boolean functions) and the values of f(X) for each X in a training set, Ξ,
the version space is that subset of hypotheses, H
v
, that is consistent with these
values. A hypothesis, h, is consistent with the values of X in Ξ if and only if
h(X) = f(X) for all X in Ξ. We say that the hypotheses in H that are not
consistent with the values in the training set are ruled out by the training set.
We could imagine (conceptually only!) that we have devices for implement
ing every function in H. An incremental training procedure could then be
deﬁned which presented each pattern in Ξ to each of these functions and then
eliminated those functions whose values for that pattern did not agree with its
given value. At any stage of the process we would then have left some subset
of functions that are consistent with the patterns presented so far; this subset
is the version space for the patterns already presented. This idea is illustrated
in Fig. 3.1.
Consider the following procedure for classifying an arbitrary input pattern,
X: the pattern is put in the same class (0 or 1) as are the majority of the
outputs of the functions in the version space. During the learning procedure,
if this majority is not equal to the value of the pattern presented, we say a
mistake is made, and we revise the version space accordingly—eliminating all
those (majority of the) functions voting incorrectly. Thus, whenever a mistake
is made, we rule out at least half of the functions remaining in the version space.
How many mistakes can such a procedure make? Obviously, we can make
no more than log
2
(H) mistakes, where H is the number of hypotheses in the
27
28 CHAPTER 3. USING VERSION SPACES FOR LEARNING
h
1
h
2
h
i
h
K
X
A Subset, H, of all
Boolean Functions
Rule out hypotheses not
consistent with training patterns
h
j
Hypotheses not ruled out
constitute the version space
K = H
1 or 0
Figure 3.1: Implementing the Version Space
original hypothesis set, H. (Note, though, that the number of training patterns
seen before this maximum number of mistakes is made might be much greater.)
This theoretical (and very impractical!) result (due to [Littlestone, 1988]) is an
example of a mistake bound—an important concept in machine learning theory.
It shows that there must exist a learning procedure that makes no more mistakes
than this upper bound. Later, we’ll derive other mistake bounds.
As a special case, if our bias was to limit H to terms, we would make no
more than log
2
(3
n
) = nlog
2
(3) = 1.585n mistakes before exhausting the version
space. This result means that if f were a term, we would make no more than
1.585n mistakes before learning f, and otherwise we would make no more than
that number of mistakes before being able to decide that f is not a term.
Even if we do not have suﬃcient training patterns to reduce the version
space to a single function, it may be that there are enough training patterns
to reduce the version space to a set of functions such that most of them assign
the same values to most of the patterns we will see henceforth. We could select
one of the remaining functions at random and be reasonably assured that it
will generalize satisfactorily. We next discuss a computationally more feasible
method for representing the version space.
3.2. VERSION GRAPHS 29
3.2 Version Graphs
Boolean functions can be ordered by generality. A Boolean function, f
1
, is more
general than a function, f
2
, (and f
2
is more speciﬁc than f
1
), if f
1
has value 1
for all of the arguments for which f
2
has value 1, and f
1
= f
2
. For example, x
3
is more general than x
2
x
3
but is not more general than x
3
+x
2
.
We can form a graph with the hypotheses, {h
i
}, in the version space as
nodes. A node in the graph, h
i
, has an arc directed to node, h
j
, if and only if
h
j
is more general than h
i
. We call such a graph a version graph. In Fig. 3.2,
we show an example of a version graph over a 3dimensional input space for
hypotheses restricted to terms (with none of them yet ruled out).
0
x
1
x
2 x
3 x
2
x
3
1
x
1
x
2
x
3
x
1
x
2
x
1
Version Graph for Terms
x
1
x
2
x
3
(for simplicity, only some arcs in the graph are shown)
(none yet ruled out)
(k = 1)
(k = 2)
(k = 3)
x
1
x
3
Figure 3.2: A Version Graph for Terms
That function, denoted here by “1,” which has value 1 for all inputs, corre
sponds to the node at the top of the graph. (It is more general than any other
term.) Similarly, the function “0” is at the bottom of the graph. Just below
“1” is a row of nodes corresponding to all terms having just one literal, and just
below them is a row of nodes corresponding to terms having two literals, and
30 CHAPTER 3. USING VERSION SPACES FOR LEARNING
so on. There are 3
3
= 27 functions altogether (the function “0,” included in
the graph, is technically not a term). To make our portrayal of the graph less
cluttered only some of the arcs are shown; each node in the actual graph has an
arc directed to all of the nodes above it that are more general.
We use this same example to show how the version graph changes as we
consider a set of labeled samples in a training set, Ξ. Suppose we ﬁrst consider
the training pattern (1, 0, 1) with value 0. Some of the functions in the version
graph of Fig. 3.2 are inconsistent with this training pattern. These ruled out
nodes are no longer in the version graph and are shown shaded in Fig. 3.3. We
also show there the threedimensional cube representation in which the vertex
(1, 0, 1) has value 0.
0
x
1
x
2 x
3
x
2
x
3
1
x
1
x
2
x
3
x
1
x
2
x
1
New Version Graph
1, 0, 1 has
value 0
x
1
x
3
x
1
x
2
x
2
x
3
x
1
x
2
x
3
x
1
x
2
x
3
x
1
x
3
(only some arcs in the graph are shown)
ruled out nodes
Figure 3.3: The Version Graph Upon Seeing (1, 0, 1)
In a version graph, there are always a set of hypotheses that are maximally
general and a set of hypotheses that are maximally speciﬁc. These are called
the general boundary set (gbs) and the speciﬁc boundary set (sbs), respectively.
In Fig. 3.4, we have the version graph as it exists after learning that (1,0,1) has
value 0 and (1, 0, 0) has value 1. The gbs and sbs are shown.
3.2. VERSION GRAPHS 31
0
x
1
x
2
x
3
x
2
x
3
1
x
1
x
2
x
3
x
1
x
2
x
3 x
1
x
3
general boundary set
(gbs)
specific boundary set (sbs)
x
1
x
2
more specific than gbs,
more general than sbs
1, 0, 1 has
value 0
x
1
x
2
x
3
1, 0, 0 has
value 1
Figure 3.4: The Version Graph Upon Seeing (1, 0, 1) and (1, 0, 0)
Boundary sets are important because they provide an alternative to repre
senting the entire version space explicitly, which would be impractical. Given
only the boundary sets, it is possible to determine whether or not any hypoth
esis (in the prescribed class of Boolean functions we are using) is a member or
not of the version space. This determination is possible because of the fact that
any member of the version space (that is not a member of one of the boundary
sets) is more speciﬁc than some member of the general boundary set and is more
general than some member of the speciﬁc boundary set.
If we limit our Boolean functions that can be in the version space to terms,
it is a simple matter to determine maximally general and maximally speciﬁc
functions (assuming that there is some term that is in the version space). A
maximally speciﬁc one corresponds to a subface of minimal dimension that
contains all the members of the training set labelled by a 1 and no members
labelled by a 0. A maximally general one corresponds to a subface of maximal
dimension that contains all the members of the training set labelled by a 1 and
no members labelled by a 0. Looking at Fig. 3.4, we see that the subface of
minimal dimension that contains (1, 0, 0) but does not contain (1, 0, 1) is just
the vertex (1, 0, 0) itself—corresponding to the function x
1
x
2
x
3
. The subface
32 CHAPTER 3. USING VERSION SPACES FOR LEARNING
of maximal dimension that contains (1, 0, 0) but does not contain (1, 0, 1) is
the bottom face of the cube—corresponding to the function x
3
. In Figs. 3.2
through 3.4 the sbs is always singular. Version spaces for terms always have
singular speciﬁc boundary sets. As seen in Fig. 3.3, however, the gbs of a
version space for terms need not be singular.
3.3 Learning as Search of a Version Space
[To be written. Relate to term learning algorithm presented in Chapter
Two. Also discuss bestﬁrst search methods. See Pat Langley’s example us
ing “pseudocells” of how to generate and eliminate hypotheses.]
Selecting a hypothesis from the version space can be thought of as a search
problem. One can start with a very general function and specialize it through
various specialization operators until one ﬁnds a function that is consistent (or
adequately so) with a set of training patterns. Such procedures are usually
called topdown methods. Or, one can start with a very special function and
generalize it—resulting in bottomup methods. We shall see instances of both
styles of learning in this book. Compare this view of topdown
versus bottomup with the
divideandconquer and the
covering (or AQ) methods of
decisiontree induction.
3.4 The Candidate Elimination Method
The candidate elimination method, is an incremental method for computing the
boundary sets. Quoting from [Hirsh, 1994, page 6]:
“The candidateelimination algorithm manipulates the boundaryset
representation of a version space to create boundary sets that rep
resent a new version space consistent with all the previous instances
plus the new one. For a positive exmple the algorithm generalizes
the elements of the [sbs] as little as possible so that they cover the
new instance yet remain consistent with past data, and removes
those elements of the [gbs] that do not cover the new instance. For
a negative instance the algorithm specializes elements of the [gbs]
so that they no longer cover the new instance yet remain consis
tent with past data, and removes from the [sbs] those elements that
mistakenly cover the new, negative instance.”
The method uses the following deﬁnitions (adapted from
[Genesereth & Nilsson, 1987]):
• a hypothesis is called suﬃcient if and only if it has value 1 for all training
samples labeled by a 1,
• a hypothesis is called necessary if and only if it has value 0 for all training
samples labeled by a 0.
3.4. THE CANDIDATE ELIMINATION METHOD 33
Here is how to think about these deﬁnitions: A hypothesis implements a suﬃ
cient condition that a training sample has value 1 if the hypothesis has value 1
for all of the positive instances; a hypothesis implements a necessary condition
that a training sample has value 1 if the hypothesis has value 0 for all of the
negative instances. A hypothesis is consistent with the training set (and thus is
in the version space) if and only if it is both suﬃcient and necessary.
We start (before receiving any members of the training set) with the function
“0” as the singleton element of the speciﬁc boundary set and with the function
“1” as the singleton element of the general boundary set. Upon receiving a new
labeled input vector, the boundary sets are changed as follows:
a. If the new vector is labelled with a 1:
The new general boundary set is obtained from the previous one by ex
cluding any elements in it that are not suﬃcient. (That is, we exclude any
elements that have value 0 for the new vector.)
The new speciﬁc boundary set is obtained from the previous one by re
placing each element, h
i
, in it by all of its least generalizations.
The hypothesis h
g
is a least generalization of h if and only if: a) h is
more speciﬁc than h
g
, b) h
g
is suﬃcient, c) no function (including h) that
is more speciﬁc than h
g
is suﬃcient, and d) h
g
is more speciﬁc than some
member of the new general boundary set. It might be that h
g
= h. Also,
least generalizations of two diﬀerent functions in the speciﬁc boundary set
may be identical.
b. If the new vector is labelled with a 0:
The new speciﬁc boundary set is obtained from the previous one by ex
cluding any elements in it that are not necessary. (That is, we exclude
any elements that have value 1 for the new vector.)
The new general boundary set is obtained from the previous one by re
placing each element, h
i
, in it by all of its least specializations.
The hypothesis h
s
is a least specialization of h if and only if: a) h is more
general than h
s
, b) h
s
is necessary, c) no function (including h) that is
more general than h
s
is necessary, and d) h
s
is more general than some
member of the new speciﬁc boundary set. Again, it might be that h
s
= h,
and least specializations of two diﬀerent functions in the general boundary
set may be identical.
As an example, suppose we present the vectors in the following order:
vector label
(1, 0, 1) 0
(1, 0, 0) 1
(1, 1, 1) 0
(0, 0, 1) 0
34 CHAPTER 3. USING VERSION SPACES FOR LEARNING
We start with general boundary set, “1”, and speciﬁc boundary set, “0.”
After seeing the ﬁrst sample, (1, 0, 1), labeled with a 0, the speciﬁc boundary
set stays at “0” (it is necessary), and we change the general boundary set to
{x
1
, x
2
, x
3
}. Each of the functions, x
1
, x
2
, and x
3
, are least specializations of
“1” (they are necessary, “1” is not, they are more general than “0”, and there
are no functions that are more general than they and also necessary).
Then, after seeing (1, 0, 0), labeled with a 1, the general boundary set
changes to {x
3
} (because x
1
and x
2
are not suﬃcient), and the speciﬁc boundary
set is changed to {x
1
x
2
x
3
}. This single function is a least generalization of “0”
(it is suﬃcient, “0” is more speciﬁc than it, no function (including “0”) that is
more speciﬁc than it is suﬃcient, and it is more speciﬁc than some member of
the general boundary set.
When we see (1, 1, 1), labeled with a 0, we do not change the speciﬁc
boundary set because its function is still necessary. We do not change the
general boundary set either because x
3
is still necessary.
Finally, when we see (0, 0, 1), labeled with a 0, we do not change the speciﬁc
boundary set because its function is still necessary. We do not change the general
boundary set either because x
3
is still necessary. Maybe I’ll put in an example of a
version graph for nonBoolean
functions.
3.5 Bibliographical and Historical Remarks
The concept of version spaces and their role in learning was ﬁrst investigated
by Tom Mitchell [Mitchell, 1982]. Although these ideas are not used in prac
tical machine learning procedures, they do provide insight into the nature of
hypothesis selection. In order to accomodate noisy data, version spaces have
been generalized by [Hirsh, 1994] to allow hypotheses that are not necessarily
consistent with the training set. More to be added.
Chapter 4
Neural Networks
In chapter two we deﬁned several important subsets of Boolean functions. Sup
pose we decide to use one of these subsets as a hypothesis set for supervised
function learning. We next have the question of how best to implement the
function as a device that gives the outputs prescribed by the function for arbi
trary inputs. In this chapter we describe how networks of nonlinear elements
can be used to implement various inputoutput functions and how they can be
trained using supervised learning methods.
Networks of nonlinear elements, interconnected through adjustable weights,
play a prominent role in machine learning. They are called neural networks be
cause the nonlinear elements have as their inputs a weighted sum of the outputs
of other elements—much like networks of biological neurons do. These networks
commonly use the threshold element which we encountered in chapter two in
our study of linearly separable Boolean functions. We begin our treatment of
neural nets by studying this threshold element and how it can be used in the
simplest of all networks, namely ones composed of a single threshold element.
4.1 Threshold Logic Units
4.1.1 Deﬁnitions and Geometry
Linearly separable (threshold) functions are implemented in a straightforward
way by summing the weighted inputs and comparing this sum to a threshold
value as shown in Fig. 4.1. This structure we call a threshold logic unit (TLU).
Its output is 1 or 0 depending on whether or not the weighted sum of its inputs is
greater than or equal to a threshold value, θ. It has also been called an Adaline
(for adaptive linear element) [Widrow, 1962, Widrow & Lehr, 1990], an LTU
(linear threshold unit), a perceptron, and a neuron. (Although the word “per
ceptron” is often used nowadays to refer to a single TLU, Rosenblatt originally
deﬁned it as a class of networks of threshold elements [Rosenblatt, 1958].)
35
36 CHAPTER 4. NEURAL NETWORKS
!
x
1
x
2
x
n+1
= 1
x
i
w
1
w
2
w
n+1
w
i
w
n
X
threshold weight
x
n
W
threshold " = 0
f
f = thresh( ! w
i
x
i
, 0)
i = 1
n+1
Figure 4.1: A Threshold Logic Unit (TLU)
The ndimensional feature or input vector is denoted by X = (x
1
, . . . , x
n
).
When we want to distinguish among diﬀerent feature vectors, we will attach
subscripts, such as X
i
. The components of X can be any realvalued numbers,
but we often specialize to the binary numbers 0 and 1. The weights of a TLU
are represented by an ndimensional weight vector, W = (w
1
, . . . , w
n
). Its
components are realvalued numbers (but we sometimes specialize to integers).
The TLU has output 1 if
¸
n
i=1
x
i
w
i
≥ θ; otherwise it has output 0. The
weighted sum that is calculated by the TLU can be simply represented as a
vector dot product, X•W. (If the pattern and weight vectors are thought of as
“column” vectors, this dot product is then sometimes written as X
t
W, where
the “row” vector X
t
is the transpose of X.) Often, the threshold, θ, of the TLU
is ﬁxed at 0; in that case, arbitrary thresholds are achieved by using (n + 1)
dimensional “augmented” vectors, Y, and V, whose ﬁrst n components are the
same as those of X and W, respectively. The (n + 1)st component, x
n+1
, of
the augmented feature vector, Y, always has value 1; the (n+1)st component,
w
n+1
, of the augmented weight vector, V, is set equal to the negative of the
desired threshold value. (When we want to emphasize the use of augmented
vectors, we’ll use the Y,V notation; however when the context of the discussion
makes it clear about what sort of vectors we are talking about, we’ll lapse back
into the more familiar X,W notation.) In the Y,V notation, the TLU has an
output of 1 if Y•V ≥ 0. Otherwise, the output is 0.
We can give an intuitively useful geometric description of a TLU. A TLU
divides the input space by a hyperplane as sketched in Fig. 4.2. The hyperplane
is the boundary between patterns for which X•W + w
n+1
> 0 and patterns
for which X•W + w
n+1
< 0. Thus, the equation of the hyperplane itself is
X•W+w
n+1
= 0. The unit vector that is normal to the hyperplane is n =
W
W
,
where W =
(w
2
1
+. . . +w
2
n
) is the length of the vector W. (The normal
4.1. THRESHOLD LOGIC UNITS 37
form of the hyperplane equation is X•n +
W
W
= 0.) The distance from the
hyperplane to the origin is
wn+1
W
, and the distance from an arbitrary point, X,
to the hyperplane is
X•W+wn+1
W
. When the distance from the hyperplane to the
origin is negative (that is, when w
n+1
< 0), then the origin is on the negative
side of the hyperplane (that is, the side for which X•W+w
n+1
< 0).
X
.
W + w
n+1
> 0
on this side
W
X
W
n =
W
W
Origin
Unit vector normal
to hyperplane
W + w
n+1
= 0 X
n + = 0 X
Equations of hyperplane:
w
n+1
W
w
n+1
W + w
n+1
X
X
.
W + w
n+1
< 0
on this side
Figure 4.2: TLU Geometry
Adjusting the weight vector, W, changes the orientation of the hyperplane;
adjusting w
n+1
changes the position of the hyperplane (relative to the origin).
Thus, training of a TLU can be achieved by adjusting the values of the weights.
In this way the hyperplane can be moved so that the TLU implements diﬀerent
(linearly separable) functions of the input.
4.1.2 Special Cases of Linearly Separable Functions
Terms
Any term of size k can be implemented by a TLU with a weight from each of
those inputs corresponding to variables occurring in the term. A weight of +1 is
used from an input corresponding to a positive literal, and a weight of −1 is used
from an input corresponding to a negative literal. (Literals not mentioned in
the term have weights of zero—that is, no connection at all—from their inputs.)
The threshold, θ, is set equal to k
p
− 1/2, where k
p
is the number of positive
literals in the term. Such a TLU implements a hyperplane boundary that is
38 CHAPTER 4. NEURAL NETWORKS
parallel to a subface of dimension (n − k) of the unit hypercube. We show a
threedimensional example in Fig. 4.3. Thus, linearly separable functions are a
superset of terms.
(1,1,1)
(1,1,0)
x
2
x
1
x
3
f = x
1
x
2
x
1
+ x
2
 3/2 = 0
Equation of plane is:
Figure 4.3: Implementing a Term
Clauses
The negation of a clause is a term. For example, the negation of the clause
f = x
1
+ x
2
+ x
3
is the term f = x
1
x
2
x
3
. A hyperplane can be used to
implement this term. If we “invert” the hyperplane, it will implement the
clause instead. Inverting a hyperplane is done by multiplying all of the TLU
weights—even w
n+1
—by −1. This process simply changes the orientation of the
hyperplane—ﬂipping it around by 180 degrees and thus changing its “positive
side.” Therefore, linearly separable functions are also a superset of clauses. We
show an example in Fig. 4.4.
4.1.3 ErrorCorrection Training of a TLU
There are several procedures that have been proposed for adjusting the weights
of a TLU. We present next a family of incremental training procedures with
parameter c. These methods make adjustments to the weight vector only when
the TLU being trained makes an error on a training pattern; they are called
errorcorrection procedures. We use augmented feature and weight vectors in
describing them.
a. We start with a ﬁnite training set, Ξ, of vectors, Y
i
, and their binary
labels.
4.1. THRESHOLD LOGIC UNITS 39
f = x
1
+ x
2
+ x
3
x
1
x
1
+ x
2
+ x
3
< 1/2 = 0
f = x
1
x
2
x
3
Equation of plane is:
x
2
x
3
Figure 4.4: Implementing a Clause
b. Compose an inﬁnite training sequence, Σ, of vectors from Ξ and their
labels such that each member of Ξ occurs inﬁnitely often in Σ. Set the
initial weight values of an TLU to arbitrary values.
c. Repeat forever:
Present the next vector, Y
i
, in Σ to the TLU and note its response.
(a) If the TLU responds correctly, make no change in the weight vector.
(b) If Y
i
is supposed to produce an output of 0 and produces an output
of 1 instead, modify the weight vector as follows:
V ←− V−c
i
Y
i
where c
i
is a positive real number called the learning rate parame
ter (whose value is diﬀererent in diﬀerent instances of this family of
procedures and may depend on i).
Note that after this adjustment the new dot product will be (V −
c
i
Y
i
)•Y
i
= V•Y
i
−c
i
Y
i
•Y
i
, which is smaller than it was before the
weight adjustment.
(c) If Y
i
is supposed to produce an output of 1 and produces an output
of 0 instead, modify the weight vector as follows:
V ←− V+c
i
Y
i
In this case, the new dot product will be (V + c
i
Y
i
)•Y
i
= V•Y
i
+
c
i
Y
i
•Y
i
, which is larger than it was before the weight adjustment.
Note that all three of these cases can be combined in the following rule:
40 CHAPTER 4. NEURAL NETWORKS
V ←− V+c
i
(d
i
−f
i
)Y
i
where d
i
is the desired response (1 or 0) for Y
i
, and f
i
is the actual
response (1 or 0) for Y
i
.]
Note also that because the weight vector V now includes the w
n+1
thresh
old component, the threshold of the TLU is also changed by these adjust
ments.
We identify two versions of this procedure:
1) In the ﬁxedincrement procedure, the learning rate parameter, c
i
, is the
same ﬁxed, positive constant for all i. Depending on the value of this constant,
the weight adjustment may or may not correct the response to an erroneously
classiﬁed feature vector.
2) In the fractionalcorrection procedure, the parameter c
i
is set to λ
Y
i
•V
Y
i
•Y
i
,
where V is the weight vector before it is changed. Note that if λ = 0, no
correction takes place at all. If λ = 1, the correction is just suﬃcient to make
Y
i
•V = 0. If λ > 1, the error will be corrected.
It can be proved that if there is some weight vector, V, that produces a
correct output for all of the feature vectors in Ξ, then after a ﬁnite number
of feature vector presentations, the ﬁxedincrement procedure will ﬁnd such a
weight vector and thus make no more weight changes. The same result holds
for the fractionalcorrection procedure if 1 < λ ≤ 2.
For additional background, proofs, and examples of errorcorrection proce
dures, see [Nilsson, 1990]. See [Maass & Tur´an, 1994] for a
hyperplaneﬁnding procedure that
makes no more than O(n
2
log n)
mistakes.
4.1.4 Weight Space
We can give an intuitive idea about how these procedures work by considering
what happens to the augmented weight vector in “weight space” as corrections
are made. We use augmented vectors in our discussion here so that the threshold
function compares the dot product, Y
i
•V, against a threshold of 0. A particular
weight vector, V, then corresponds to a point in (n + 1)dimensional weight
space. Now, for any pattern vector, Y
i
, consider the locus of all points in
weight space corresponding to weight vectors yielding Y
i
•V = 0. This locus is
a hyperplane passing through the origin of the (n+1)dimensional space. Each
pattern vector will have such a hyperplane corresponding to it. Weight points
in one of the halfspaces deﬁned by this hyperplane will cause the corresponding
pattern to yield a dot product less than 0, and weight points in the other half
space will cause the corresponding pattern to yield a dot product greater than
0.
We show a schematic representation of such a weight space in Fig. 4.5.
There are four pattern hyperplanes, 1, 2, 3, 4 , corresponding to patterns Y
1
,
4.1. THRESHOLD LOGIC UNITS 41
Y
2
, Y
3
, Y
4
, respectively, and we indicate by an arrow the halfspace for each
in which weight vectors give dot products greater than 0. Suppose we wanted
weight values that would give positive responses for patterns Y
1
, Y
3
, and Y
4
,
and a negative response for pattern Y
2
. The weight point, V, indicated in the
ﬁgure is one such set of weight values.
2
3
4
1
V
Figure 4.5: Weight Space
The question of whether or not there exists a weight vector that gives desired
responses for a given set of patterns can be given a geometric interpretation. To
do so involves reversing the “polarity” of those hyperplanes corresponding to
patterns for which a negative response is desired. If we do that for our example
above, we get the weight space diagram shown in Fig. 4.6.
2
3
4
1
V
0
1
1
2
3
2
3
4
Figure 4.6: Solution Region in Weight Space
42 CHAPTER 4. NEURAL NETWORKS
If a weight vector exists that correctly classiﬁes a set of patterns, then the
halfspaces deﬁned by the correct responses for these patterns will have a non
empty intersection, called the solution region. The solution region will be a
“hyperwedge” region whose vertex is at the origin of weight space and whose
crosssection increases with increasing distance from the origin. This region
is shown shaded in Fig. 4.6. (The boxed numbers show, for later purposes,
the number of errors made by weight vectors in each of the regions.) The
ﬁxedincrement errorcorrection procedure changes a weight vector by moving it
normal to any pattern hyperplane for which that weight vector gives an incorrect
response. Suppose in our example that we present the patterns in the sequence
Y
1
, Y
2
, Y
3
, Y
4
, and start the process with a weight point V
1
, as shown in Fig.
4.7. Starting at V
1
, we see that it gives an incorrect response for pattern Y
1
, so
we move V
1
to V
2
in a direction normal to plane 1. (That is what adding Y
1
to
V
1
does.) Y
2
gives an incorrect response for pattern Y
2
, and so on. Ultimately,
the responses are only incorrect for planes bounding the solution region. Some
of the subsequent corrections may overshoot the solution region, but eventually
we work our way out far enough in the solution region that corrections (for
a ﬁxed increment size) take us within it. The proofs for convergence of the
ﬁxedincrement rule make this intuitive argument precise.
2
3
4
1
V
V
1
V
2
V
3
V
4
V
5
V
6
Figure 4.7: Moving Into the Solution Region
4.1.5 The WidrowHoﬀ Procedure
The WidrowHoﬀ procedure (also called the LMS or the delta procedure) at
tempts to ﬁnd weights that minimize a squarederror function between the pat
tern labels and the dot product computed by a TLU. For this purpose, the
pattern labels are assumed to be either +1 or −1 (instead of 1 or 0). The
4.1. THRESHOLD LOGIC UNITS 43
squared error for a pattern, X
i
, with label d
i
(for desired output) is:
ε
i
= (d
i
−
n+1
¸
j=1
x
ij
w
j
)
2
where x
ij
is the jth component of X
i
. The total squared error (over all patterns
in a training set, Ξ, containing m patterns) is then:
ε =
m
¸
i=1
(d
i
−
n+1
¸
j=1
x
ij
w
j
)
2
We want to choose the weights w
j
to minimize this squared error. One way to
ﬁnd such a set of weights is to start with an arbitrary weight vector and move it
along the negative gradient of ε as a function of the weights. Since ε is quadratic
in the w
j
, we know that it has a global minimum, and thus this steepest descent
procedure is guaranteed to ﬁnd the minimum. Each component of the gradient
is the partial derivative of ε with respect to one of the weights. One problem
with taking the partial derivative of ε is that ε depends on all the input vectors
in Ξ. Often, it is preferable to use an incremental procedure in which we try the
TLU on just one element, X
i
, of Ξ at a time, compute the gradient of the single
pattern squared error, ε
i
, make the appropriate adjustment to the weights, and
then try another member of Ξ. Of course, the results of the incremental version
can only approximate those of the batch one, but the approximation is usually
quite eﬀective. We will be describing the incremental version here.
The jth component of the gradient of the singlepattern error is:
∂ε
i
∂w
j
= −2(d
i
−
n+1
¸
j=1
x
ij
w
j
)x
ij
An adjustment in the direction of the negative gradient would then change each
weight as follows:
w
j
←− w
j
+c
i
(d
i
−f
i
)x
ij
where f
i
=
¸
n+1
j=1
x
ij
w
j
, and c
i
governs the size of the adjustment. The entire
weight vector (in augmented, or V, notation) is thus adjusted according to the
following rule:
V ←− V+c
i
(d
i
−f
i
)Y
i
where, as before, Y
i
is the ith augmented pattern vector.
The WidrowHoﬀ procedure makes adjustments to the weight vector when
ever the dot product itself, Y
i
•V, does not equal the speciﬁed desired target
44 CHAPTER 4. NEURAL NETWORKS
value, d
i
(which is either 1 or −1). The learningrate factor, c
i
, might de
crease with time toward 0 to achieve asymptotic convergence. The Widrow
Hoﬀ formula for changing the weight vector has the same form as the standard
ﬁxedincrement errorcorrection formula. The only diﬀerence is that f
i
is the
thresholded response of the TLU in the errorcorrection case while it is the dot
product itself for the WidrowHoﬀ procedure.
Finding weight values that give the desired dot products corresponds to solv
ing a set of linear equalities, and the WidrowHoﬀ procedure can be interpreted
as a descent procedure that attempts to minimize the meansquarederror be
tween the actual and desired values of the dot product. (For more on Widrow
Hoﬀ and other related procedures, see [Duda & Hart, 1973, pp. 151ﬀ].) Examples of training curves for
TLU’s; performance on training
set; performance on test set;
cumulative number of corrections.
4.1.6 Training a TLU on NonLinearlySeparable Training
Sets
When the training set is not linearly separable (perhaps because of noise or
perhaps inherently), it may still be desired to ﬁnd a “best” separating hy
perplane. Typically, the errorcorrection procedures will not do well on non
linearlyseparable training sets because they will continue to attempt to correct
inevitable errors, and the hyperplane will never settle into an acceptable place.
Several methods have been proposed to deal with this case. First, we might
use the WidrowHoﬀ procedure, which (although it will not converge to zero
error on nonlinearly separable problems) will give us a weight vector that min
imizes the meansquarederror. A meansquarederror criterion often gives un
satisfactory results, however, because it prefers many small errors to a few large
ones. As an alternative, error correction with a continuous decrease toward zero
of the value of the learning rate constant, c, will result in ever decreasing changes
to the hyperplane. Duda [Duda, 1966] has suggested keeping track of the average
value of the weight vector during error correction and using this average to give a
separating hyperplane that performs reasonably well on nonlinearlyseparable
problems. Gallant [Gallant, 1986] proposed what he called the “pocket algo
rithm.” As described in [Hertz, Krogh, & Palmer, 1991, p. 160]:
. . . the pocket algorithm . . . consists simply in storing (or “putting
in your pocket”) the set of weights which has had the longest un
modiﬁed run of successes so far. The algorithm is stopped after some
chosen time t . . .
After stopping, the weights in the pocket are used as a set that should give a
small number of errors on the training set. Errorcorrection proceeds as usual
with the ordinary set of weights. Also see methods proposed by
[John, 1995] and by
[Marchand & Golea, 1993]. The
latter is claimed to outperform the
pocket algorithm.
4.2 Linear Machines
The natural generalization of a (twocategory) TLU to an Rcategory classiﬁer
is the structure, shown in Fig. 4.8, called a linear machine. Here, to use more
4.2. LINEAR MACHINES 45
familiar notation, the Ws and X are meant to be augmented vectors (with an
(n+1)st component). Such a structure is also sometimes called a “competitive”
net or a “winnertakeall” net. The output of the linear machine is one of
the numbers, {1, . . . , R}, corresponding to which dot product is largest. Note
that when R = 2, the linear machine reduces to a TLU with weight vector
W = (W
1
−W
2
).
X
W
1
W
R
. . .
Y
Y
ARGMAX
W
1
.
X
W
R
.
X
Figure 4.8: A Linear Machine
The diagram in Fig. 4.9 shows the character of the regions in a 2dimensional
space created by a linear machine for R = 5. In n dimensions, every pair of
regions is either separated by a section of a hyperplane or is nonadjacent.
R
1
R
3
R
4
R
5
X
.
W
4
* X
.
W
i
for i & 4
R
2
In this region:
Figure 4.9: Regions For a Linear Machine
To train a linear machine, there is a straightforward generalization of the
2category errorcorrection rule. Assemble the patterns in the training set into
a sequence as before.
a. If the machine classiﬁes a pattern correctly, no change is made to any of
46 CHAPTER 4. NEURAL NETWORKS
the weight vectors.
b. If the machine mistakenly classiﬁes a category u pattern, X
i
, in category
v (u = v), then:
W
u
←− W
u
+c
i
X
i
and
W
v
←− W
v
−c
i
X
i
and all other weight vectors are not changed.
This correction increases the value of the uth dot product and decreases the
value of the vth dot product. Just as in the 2category ﬁxed increment proce
dure, this procedure is guaranteed to terminate, for constant c
i
, if there exists
weight vectors that make correct separations of the training set. Note that when
R = 2, this procedure reduces to the ordinary TLU errorcorrection procedure.
A proof that this procedure terminates is given in [Nilsson, 1990, pp. 8890]
and in [Duda & Hart, 1973, pp. 174177].
4.3 Networks of TLUs
4.3.1 Motivation and Examples
Layered Networks
To classify correctly all of the patterns in nonlinearlyseparable training sets re
quires separating surfaces more complex than hyperplanes. One way to achieve
more complex surfaces is with networks of TLUs. Consider, for example, the 2
dimensional, even parity function, f = x
1
x
2
+x
1
x
2
. No single line through the
2dimensional square can separate the vertices (1,1) and (0,0) from the vertices
(1,0) and (0,1)—the function is not linearly separable and thus cannot be im
plemented by a single TLU. But, the network of three TLUs shown in Fig. 4.10
does implement this function. In the ﬁgure, we show the weight values along
input lines to each TLU and the threshold value inside the circle representing
the TLU.
The function implemented by a network of TLUs depends on its topology
as well as on the weights of the individual TLUs. Feedforward networks have
no cycles; in a feedforward network no TLU’s input depends (through zero
or more intermediate TLUs) on that TLU’s output. (Networks that are not
feedforward are called recurrent networks). If the TLUs of a feedforward network
are arranged in layers, with the elements of layer j receiving inputs only from
TLUs in layer j − 1, then we say that the network is a layered, feedforward
4.3. NETWORKS OF TLUS 47
f
x
1
x
2
1.5
0.5
0.5
1
1
1
1
1
1
Figure 4.10: A Network for the Even Parity Function
network. The network shown in Fig. 4.10 is a layered, feedforward network
having two layers (of weights). (Some people count the layers of TLUs and
include the inputs as a layer also; they would call this network a threelayer
network.) In general, a feedforward, layered network has the structure shown
in Fig. 4.11. All of the TLUs except the “output” units are called hidden units
(they are “hidden” from the output).
X
hidden units
output units
Figure 4.11: A Layered, Feedforward Network
Implementing DNF Functions by TwoLayer Networks
We have already deﬁned kterm DNF functions—they are DNF functions having
k terms. A kterm DNF function can be implemented by a twolayer network
with k units in the hidden layer—to implement the k terms—and one output
unit to implement the disjunction of these terms. Since any Boolean function
has a DNF form, any Boolean function can be implemented by some twolayer
network of TLUs. As an example, consider the function f = x
1
x
2
+ x
2
x
3
+
x
1
x
3
. The form of the network that implements this function is shown in Fig.
4.12. (We leave it to the reader to calculate appropriate values of weights and
48 CHAPTER 4. NEURAL NETWORKS
thresholds.) The 3cube representation of the function is shown in Fig. 4.13.
The network of Fig. 4.12 can be designed so that each hidden unit implements
one of the planar boundaries shown in Fig. 4.13.
x
conjuncts
disjunct
A Feedforward, 2layer Network
TLUs
disjunction
of terms
conjunctions
of literals
(terms)
Figure 4.12: A TwoLayer Network
x
2
x
1
x
3
f = x
1
x
2
+ x
2
x
3
+ x
1
x
3
Figure 4.13: Three Planes Implemented by the Hidden Units
To train a twolayer network that implements a kterm DNF function, we
ﬁrst note that the output unit implements a disjunction, so the weights in the
ﬁnal layer are ﬁxed. The weights in the ﬁrst layer (except for the “threshold
weights”) can all have values of 1, −1, or 0. Later, we will present a training
procedure for this ﬁrst layer of weights. Discuss halfspace intersections,
halfspace unions, NPhardness of
optimal versions,
singlesideerrorhypeplane
methods, relation to “AQ”
methods.
4.3. NETWORKS OF TLUS 49
Important Comment About Layered Networks
Adding additional layers cannot compensate for an inadequate ﬁrst layer of
TLUs. The ﬁrst layer of TLUs partitions the feature space so that no two dif
ferently labeled vectors are in the same region (that is, so that no two such
vectors yield the same set of outputs of the ﬁrstlayer units). If the ﬁrst layer
does not partition the feature space in this way, then regardless of what subse
quent layers do, the ﬁnal outputs will not be consistent with the labeled training
set. Add diagrams showing the
nonlinear transformation
performed by a layered network.
4.3.2 Madalines
TwoCategory Networks
An interesting example of a layered, feedforward network is the twolayer one
which has an odd number of hidden units, and a “votetaking” TLU as the
output unit. Such a network was called a “Madaline” (for many adalines by
Widrow. Typically, the response of the vote taking unit is deﬁned to be the
response of the majority of the hidden units, although other output logics are
possible. Ridgway [Ridgway, 1962] proposed the following errorcorrection rule
for adjusting the weights of the hidden units of a Madaline:
• If the Madaline correctly classiﬁes a pattern, X
i
, no corrections are made
to any of the hidden units’ weight vectors,
• If the Madaline incorrectly classiﬁes a pattern, X
i
, then determine the
minimum number of hidden units whose responses need to be changed
(from 0 to 1 or from 1 to 0—depending on the type of error) in order that
the Madaline would correctly classify X
i
. Suppose that minimum number
is k
i
. Of those hidden units voting incorrectly, change the weight vectors
of those k
i
of them whose dot products are closest to 0 by using the error
correction rule:
W←− W+c
i
(d
i
−f
i
)X
i
where d
i
is the desired response of the hidden unit (0 or 1) and f
i
is the
actual response (0 or 1). (We assume augmented vectors here even though
we are using X, W notation.)
That is, we perform errorcorrection on just enough hidden units to correct
the vote to a majority voting correctly, and we change those that are easiest to
change. There are example problems in which even though a set of weight values
exists for a given Madaline structure such that it could classify all members of
a training set correctly, this procedure will fail to ﬁnd them. Nevertheless, the
procedure works eﬀectively in most experiments with it.
We leave it to the reader to think about how this training procedure could
be modiﬁed if the output TLU implemented an or function (or an and function).
50 CHAPTER 4. NEURAL NETWORKS
RCategory Madalines and ErrorCorrecting Output Codes
If there are k hidden units (k > 1) in a twolayer network, their responses
correspond to vertices of a kdimensional hypercube. The ordinary twocategory
Madaline identiﬁes two special points in this space, namely the vertex consisting
of k 1’s and the vertex consisting of k 0’s. The Madaline’s response is 1 if the
point in “hiddenunitspace” is closer to the all 1’s vertex than it is to the all
0’s vertex. We could design an Rcategory Madaline by identifying R vertices
in hiddenunit space and then classifying a pattern according to which of these
vertices the hiddenunit response is closest to. A machine using that idea was
implemented in the early 1960s at SRI [Brain, et al., 1962]. It used the fact
that the 2
p
socalled maximallength shiftregister sequences [Peterson, 1961, pp.
147ﬀ] in a (2
p
−1)dimensional Boolean space are mutually equidistant (for any
integer p). For similar, more recent work see [Dietterich & Bakiri, 1991].
4.3.3 Piecewise Linear Machines
A twocategory training set is linearly separable if there exists a threshold func
tion that correctly classiﬁes all members of the training set. Similarly, we can
say that an Rcategory training set is linearly separable if there exists a linear
machine that correctly classiﬁes all members of the training set. When an R
category problem is not linearly separable, we need a more powerful classiﬁer.
A candidate is a structure called a piecewise linear (PWL) machine illustrated
in Fig. 4.14.
X
W
1
W
1
. . .
Y
Y
MAX
. . .
Y
Y
MAX
. . .
W
R
W
R
ARG
MAX
1
R
1
N
1
1
N
R
Figure 4.14: A Piecewise Linear Machine
4.3. NETWORKS OF TLUS 51
The PWL machine groups its weighted summing units into R banks corre
sponding to the R categories. An input vector X is assigned to that category
corresponding to the bank with the largest weighted sum. We can use an error
correction training algorithm similar to that used for a linear machine. If a
pattern is classiﬁed incorrectly, we subtract (a constant times) the pattern vec
tor from the weight vector producing the largest dot product (it was incorrectly
the largest) and add (a constant times) the pattern vector to that weight vector
in the correct bank of weight vectors whose dot product is locally largest in
that bank. (Again, we use augmented vectors here.) Unfortunately, there are
example training sets that are separable by a given PWL machine structure
but for which this errorcorrection training method fails to ﬁnd a solution. The
method does appear to work well in some situations [Duda & Fossum, 1966], al
though [Nilsson, 1965, page 89] observed that “it is probably not a very eﬀective
method for training PWL machines having more than three [weight vectors] in
each bank.”
4.3.4 Cascade Networks
Another interesting class of feedforward networks is that in which all of the TLUs
are ordered and each TLU receives inputs from all of the pattern components
and from all TLUs lower in the ordering. Such a network is called a cascade
network. An example is shown in Fig. 4.15 in which the TLUs are labeled by
the linearly separable functions (of their inputs) that they implement. Each
TLU in the network implements a set of 2
k
parallel hyperplanes, where k is
the number of TLUs from which it receives inputs. (Each of the k preceding
TLUs can have an output of 1 or 0; that’s 2
k
diﬀerent combinations—resulting
in 2
k
diﬀerent positions for the parallel hyperplanes.) We show a 3dimensional
sketch for a network of two TLUs in Fig. 4.16. The reader might consider how
the ndimensional parity function might be implemented by a cascade network
having log
2
n TLUs.
x
L
1
L
2
output
L
3
Figure 4.15: A Cascade Network
52 CHAPTER 4. NEURAL NETWORKS
L
1
L
2
L
2
Figure 4.16: Planes Implemented by a Cascade Network with Two TLUs
Cascade networks might be trained by ﬁrst training L
1
to do as good a job
as possible at separating all the training patterns (perhaps by using the pocket
algorithm, for example), then training L
2
(including the weight from L
1
to L
2
)
also to do as good a job as possible at separating all the training patterns,
and so on until the resulting network classiﬁes the patterns in the training set
satisfactorily. Also mention the
“cascadecorrelation” method of
[Fahlman & Lebiere, 1990].
4.4 Training Feedforward Networks by Back
propagation
4.4.1 Notation
The general problem of training a network of TLUs is diﬃcult. Consider, for
example, the layered, feedforward network of Fig. 4.11. If such a network makes
an error on a pattern, there are usually several diﬀerent ways in which the error
can be corrected. It is diﬃcult to assign “blame” for the error to any particular
TLU in the network. Intuitively, one looks for weightadjusting procedures that
move the network in the correct direction (relative to the error) by making
minimal changes. In this spirit, the WidrowHoﬀ method of gradient descent
has been generalized to deal with multilayer networks.
In explaining this generalization, we use Fig. 4.17 to introduce some nota
tion. This network has only one output unit, but, of course, it is possible to have
several TLUs in the output layer—each implementing a diﬀerent function. Each
of the layers of TLUs will have outputs that we take to be the components of
vectors, just as the input features are components of an input vector. The jth
layer of TLUs (1 ≤ j < k) will have as their outputs the vector X
(j)
. The input
feature vector is denoted by X
(0)
, and the ﬁnal output (of the kth layer TLU)
is f. Each TLU in each layer has a weight vector (connecting it to its inputs)
and a threshold; the ith TLU in the jth layer has a weight vector denoted by
W
(j)
i
. (We will assume that the “threshold weight” is the last component of
the associated weight vector; we might have used V notation instead to include
4.4. TRAININGFEEDFORWARDNETWORKS BYBACKPROPAGATION53
this threshold component, but we have chosen here to use the familiar X,W
notation, assuming that these vectors are “augmented” as appropriate.) We
denote the weighted sum input to the ith threshold unit in the jth layer by
s
(j)
i
. (That is, s
(j)
i
= X
(j−1)
•W
(j)
i
.) The number of TLUs in the jth layer is
given by m
j
. The vector W
(j)
i
has components w
(j)
l,i
for l = 1, . . . , m
(j−1)
+ 1.
X
(0)
. . .
. . .
. . .
. . .
W
i
(1)
W
(k)
X
(1)
m
1
TLUs
. . .
W
i
(j)
. . .
X
(j)
. . .
W
i
(k1)
X
(k1)
m
j
TLUs m
(k1)
TLUs
w
li
(j)
w
l
(k)
First Layer jth Layer (k1)th Layer kth Layer
. . .
f
s
i
(1)
s
i
(j)
s
i
(k1)
s
(k)
Figure 4.17: A klayer Network
4.4.2 The Backpropagation Method
A gradient descent method, similar to that used in the Widrow Hoﬀ method,
has been proposed by various authors for training a multilayer, feedforward
network. As before, we deﬁne an error function on the ﬁnal output of the
network and we adjust each weight in the network so as to minimize the error.
If we have a desired response, d
i
, for the ith input vector, X
i
, in the training
set, Ξ, we can compute the squared error over the entire training set to be:
ε =
¸
Xi Ξ
(d
i
−f
i
)
2
where f
i
is the actual response of the network for input X
i
. To do gradient
descent on this squared error, we adjust each weight in the network by an
amount proportional to the negative of the partial derivative of ε with respect
to that weight. Again, we use a singlepattern error function so that we can
use an incremental weight adjustment procedure. The squared error for a single
input vector, X, evoking an output of f when the desired output is d is:
54 CHAPTER 4. NEURAL NETWORKS
ε = (d −f)
2
It is convenient to take the partial derivatives of ε with respect to the various
weights in groups corresponding to the weight vectors. We deﬁne a partial
derivative of a quantity φ, say, with respect to a weight vector, W
(j)
i
, thus:
∂φ
∂W
(j)
i
def
=
¸
∂φ
∂w
(j)
1i
, . . . ,
∂φ
∂w
(j)
li
, . . . ,
∂φ
∂w
(j)
mj−1+1,i
¸
where w
(j)
li
is the lth component of W
(j)
i
. This vector partial derivative of φ is
called the gradient of φ with respect to W and is sometimes denoted by ∇
W
φ.
Since ε’s dependence on W
(j)
i
is entirely through s
(j)
i
, we can use the chain
rule to write:
∂ε
∂W
(j)
i
=
∂ε
∂s
(j)
i
∂s
(j)
i
∂W
(j)
i
Because s
(j)
i
= X
(j−1)
•W
(j)
i
,
∂s
(j)
i
∂W
(j)
i
= X
(j−1)
. Substituting yields:
∂ε
∂W
(j)
i
=
∂ε
∂s
(j)
i
X
(j−1)
Note that
∂ε
∂s
(j)
i
= −2(d −f)
∂f
∂s
(j)
i
. Thus,
∂ε
∂W
(j)
i
= −2(d −f)
∂f
∂s
(j)
i
X
(j−1)
The quantity (d−f)
∂f
∂s
(j)
i
plays an important role in our calculations; we shall
denote it by δ
(j)
i
. Each of the δ
(j)
i
’s tells us how sensitive the squared error of
the network output is to changes in the input to each threshold function. Since
we will be changing weight vectors in directions along their negative gradient,
our fundamental rule for weight changes throughout the network will be:
W
(j)
i
←W
(j)
i
+c
(j)
i
δ
(j)
i
X
(j−1)
where c
(j)
i
is the learning rate constant for this weight vector. (Usually, the
learning rate constants for all weight vectors in the network are the same.) We
see that this rule is quite similar to that used in the error correction procedure
4.4. TRAININGFEEDFORWARDNETWORKS BYBACKPROPAGATION55
for a single TLU. A weight vector is changed by the addition of a constant times
its vector of (unweighted) inputs.
Now, we must turn our attention to the calculation of the δ
(j)
i
’s. Using the
deﬁnition, we have:
δ
(j)
i
= (d −f)
∂f
∂s
(j)
i
We have a problem, however, in attempting to carry out the partial deriva
tives of f with respect to the s’s. The network output, f, is not continuously
diﬀerentiable with respect to the s’s because of the presence of the threshold
functions. Most small changes in these sums do not change f at all, and when
f does change, it changes abruptly from 1 to 0 or vice versa.
A way around this diﬃculty was proposed by Werbos [Werbos, 1974] and
(perhaps independently) pursued by several other researchers, for example
[Rumelhart, Hinton, & Williams, 1986]. The trick involves replacing all the
threshold functions by diﬀerentiable functions called sigmoids.
1
The output
of a sigmoid function, superimposed on that of a threshold function, is shown
in Fig. 4.18. Usually, the sigmoid function used is f(s) =
1
1+e
−s
, where s is
the input and f is the output.
sigmoid
threshold function
f (s)
s
f (s) = 1/[1 + e
<s
]
Figure 4.18: A Sigmoid Function
1
[Russell & Norvig 1995, page 595] attributes the use of this idea to [Bryson & Ho 1969].
56 CHAPTER 4. NEURAL NETWORKS
We show the network containing sigmoid units in place of TLUs in Fig. 4.19.
The output of the ith sigmoid unit in the jth layer is denoted by f
(j)
i
. (That
is, f
(j)
i
=
1
1+e
−s
(j)
i
.)
X
(0)
. . .
. . .
. . .
. . .
W
i
(1)
s
i
(1)
W
(k)
X
(1)
f
i
(1)
m
1
sigmoids
. . .
W
i
(j)
f
i
(j)
s
i
(j)
. . .
X
(j)
. . .
W
i
(k1)
f
i
(k1)
s
i
(k1)
f
(k)
s
(k)
X
(k1)
m
j
sigmoids
m
(k1)
sigmoids
w
li
(j)
w
l
(k)
b
i
(j)
b
i
(1)
b
i
(k1)
b
(k)
First Layer jth Layer (k1)th Layer kth Layer
. . .
Figure 4.19: A Network with Sigmoid Units
4.4.3 Computing Weight Changes in the Final Layer
We ﬁrst calculate δ
(k)
in order to compute the weight change for the ﬁnal sigmoid
unit:
4.4. TRAININGFEEDFORWARDNETWORKS BYBACKPROPAGATION57
δ
(k)
= (d −f
(k)
)
∂f
(k)
∂s
(k)
Given the sigmoid function that we are using, namely f(s) =
1
1+e
−s
, we have
that
∂f
∂s
= f(1 −f). Substituting gives us:
δ
(k)
= (d −f
(k)
)f
(k)
(1 −f
(k)
)
Rewriting our general rule for weight vector changes, the weight vector in
the ﬁnal layer is changed according to the rule:
W
(k)
←W
(k)
+c
(k)
δ
(k)
X
(k−1)
where δ
(k)
= (d −f
(k)
)f
(k)
(1 −f
(k)
)
It is interesting to compare backpropagation to the errorcorrection rule and
to the WidrowHoﬀ rule. The backpropagation weight adjustment for the single
element in the ﬁnal layer can be written as:
W←− W+c(d −f)f(1 −f)X
Written in the same format, the errorcorrection rule is:
W←− W+c(d −f)X
and the WidrowHoﬀ rule is:
W←− W+c(d −f)X
The only diﬀerence (except for the fact that f is not thresholded in Widrow
Hoﬀ) is the f(1 − f) term due to the presence of the sigmoid function. With
the sigmoid function, f(1 − f) can vary in value from 0 to 1. When f is 0,
f(1 − f) is also 0; when f is 1, f(1 − f) is 0; f(1 − f) obtains its maximum
value of 1/4 when f is 1/2 (that is, when the input to the sigmoid is 0). The
sigmoid function can be thought of as implementing a “fuzzy” hyperplane. For
a pattern far away from this fuzzy hyperplane, f(1 − f) has value close to 0,
and the backpropagation rule makes little or no change to the weight values
regardless of the desired output. (Small changes in the weights will have little
eﬀect on the output for inputs far from the hyperplane.) Weight changes are
only made within the region of “fuzz” surrounding the hyperplane, and these
changes are in the direction of correcting the error, just as in the errorcorrection
and WidrowHoﬀ rules.
58 CHAPTER 4. NEURAL NETWORKS
4.4.4 Computing Changes to the Weights in Intermediate
Layers
Using our expression for the δ’s, we can similarly compute how to change each
of the weight vectors in the network. Recall:
δ
(j)
i
= (d −f)
∂f
∂s
(j)
i
Again we use a chain rule. The ﬁnal output, f, depends on s
(j)
i
through
each of the summed inputs to the sigmoids in the (j + 1)th layer. So:
δ
(j)
i
= (d −f)
∂f
∂s
(j)
i
= (d −f)
¸
∂f
∂s
(j+1)
1
∂s
(j+1)
1
∂s
(j)
i
+· · · +
∂f
∂s
(j+1)
l
∂s
(j+1)
l
∂s
(j)
i
+· · · +
∂f
∂s
(j+1)
mj+1
∂s
(j+1)
mj+1
∂s
(j)
i
¸
=
mj+1
¸
l=1
(d −f)
∂f
∂s
(j+1)
l
∂s
(j+1)
l
∂s
(j)
i
=
mj+1
¸
l=1
δ
(j+1)
l
∂s
(j+1)
l
∂s
(j)
i
It remains to compute the
∂s
(j+1)
l
∂s
(j)
i
’s. To do that we ﬁrst write:
s
(j+1)
l
= X
(j)
•W
(j+1)
l
=
mj+1
¸
ν=1
f
(j)
ν
w
(j+1)
νl
And then, since the weights do not depend on the s’s:
∂s
(j+1)
l
∂s
(j)
i
=
∂
¸
mj+1
ν=1
f
(j)
ν
w
(j+1)
νl
∂s
(j)
i
=
mj+1
¸
ν=1
w
(j+1)
νl
∂f
(j)
ν
∂s
(j)
i
Now, we note that
∂f
(j)
ν
∂s
(j)
i
= 0 unless ν = i, in which case
∂f
(j)
ν
∂s
(j)
ν
= f
(j)
ν
(1 − f
(j)
ν
).
Therefore:
∂s
(j+1)
l
∂s
(j)
i
= w
(j+1)
il
f
(j)
i
(1 −f
(j)
i
)
4.4. TRAININGFEEDFORWARDNETWORKS BYBACKPROPAGATION59
We use this result in our expression for δ
(j)
i
to give:
δ
(j)
i
= f
(j)
i
(1 −f
(j)
i
)
mj+1
¸
l=1
δ
(j+1)
l
w
(j+1)
il
The above equation is recursive in the δ’s. (It is interesting to note that
this expression is independent of the error function; the error function explicitly
aﬀects only the computation of δ
(k)
.) Having computed the δ
(j+1)
i
’s for layer
j + 1, we can use this equation to compute the δ
(j)
i
’s. The base case is δ
(k)
,
which we have already computed:
δ
(k)
= (d −f
(k)
)f
(k)
(1 −f
(k)
)
We use this expression for the δ’s in our generic weight changing rule, namely:
W
(j)
i
←W
(j)
i
+c
(j)
i
δ
(j)
i
X
(j−1)
Although this rule appears complex, it has an intuitively reasonable explanation.
The quantity δ
(k)
= (d −f)f(1 −f) controls the overall amount and sign of all
weight adjustments in the network. (Adjustments diminish as the ﬁnal output,
f, approaches either 0 or 1, because they have vanishing eﬀect on f then.) As
the recursion equation for the δ’s shows, the adjustments for the weights going
in to a sigmoid unit in the jth layer are proportional to the eﬀect that such
adjustments have on that sigmoid unit’s output (its f
(j)
(1−f
(j)
) factor). They
are also proportional to a kind of “average” eﬀect that any change in the output
of that sigmoid unit will have on the ﬁnal output. This average eﬀect depends
on the weights going out of the sigmoid unit in the jth layer (small weights
produce little downstream eﬀect) and the eﬀects that changes in the outputs of
(j +1)th layer sigmoid units will have on the ﬁnal output (as measured by the
δ
(j+1)
’s). These calculations can be simply implemented by “backpropagating”
the δ’s through the weights in reverse direction (thus, the name backprop for
this algorithm).
4.4.5 Variations on Backprop
[To be written: problem of local minima, simulated annealing, momemtum
(Plaut, et al., 1986, see [Hertz, Krogh, & Palmer, 1991]), quickprop, regulariza
tion methods]
Simulated Annealing
To apply simulated annealing, the value of the learning rate constant is gradually
decreased with time. If we fall early into an errorfunction valley that is not
very deep (a local minimum), it typically will neither be very broad, and soon
60 CHAPTER 4. NEURAL NETWORKS
a subsequent large correction will jostle us out of it. It is less likely that we will
move out of deep valleys, and at the end of the process (with very small values
of the learning rate constant), we descend to its deepest point. The process
gets its name by analogy with annealing in metallurgy, in which a material’s
temperature is gradually decreased allowing its crystalline structure to reach a
minimal energy state.
4.4.6 An Application: Steering a Van
A neural network system called ALVINN (Autonomous Land Vehicle in a Neural
Network) has been trained to steer a Chevy van successfully on ordinary roads
and highways at speeds of 55 mph [Pomerleau, 1991, Pomerleau, 1993]. The
input to the network is derived from a lowresolution (30 x 32) television image.
The TV camera is mounted on the van and looks at the road straight ahead.
This image is sampled and produces a stream of 960dimensional input vectors
to the neural network. The network is shown in Fig. 4.20.
960 inputs
30 x 32 retina
. . .
5 hidden
units connected
to all 960 inputs
30 output units
connected to all
hidden units
. . .
sharp left
sharp right
straight ahead
centroid
of outputs
steers
vehicle
Figure 4.20: The ALVINN Network
The network has ﬁve hidden units in its ﬁrst layer and 30 output units in the
second layer; all are sigmoid units. The output units are arranged in a linear
order and control the van’s steering angle. If a unit near the top of the array
of output units has a higher output than most of the other units, the van is
steered to the left; if a unit near the bottom of the array has a high output, the
van is steered to the right. The “centroid” of the responses of all of the output
4.5. SYNERGIES BETWEENNEURAL NETWORKANDKNOWLEDGEBASEDMETHODS61
units is computed, and the van’s steering angle is set at a corresponding value
between hard left and hard right.
The system is trained by a modiﬁed online training regime. A driver drives
the van, and his actual steering angles are taken as the correct labels for the
corresponding inputs. The network is trained incrementally by backprop to
produce the driverspeciﬁed steering angles in response to each visual pattern
as it occurs in real time while driving.
This simple procedure has been augmented to avoid two potential problems.
First, since the driver is usually driving well, the network would never get any
experience with farfromcenter vehicle positions and/or incorrect vehicle orien
tations. Also, on long, straight stretches of road, the network would be trained
for a long time only to produce straightahead steering angles; this training
would swamp out earlier training to follow a curved road. We wouldn’t want
to try to avoid these problems by instructing the driver to drive erratically
occasionally, because the system would learn to mimic this erratic behavior.
Instead, each original image is shifted and rotated in software to create 14
additional images in which the vehicle appears to be situated diﬀerently relative
to the road. Using a model that tells the system what steering angle ought to
be used for each of these shifted images, given the driverspeciﬁed steering angle
for the original image, the system constructs an additional 14 labeled training
patterns to add to those encountered during ordinary driver training.
4.5 Synergies Between Neural Network and
KnowledgeBased Methods
To be written; discuss
rulegenerating procedures (such as
[Towell & Shavlik, 1992]) and how
expertprovided rules can aid
neural net training and viceversa
[Towell, Shavlik, & Noordweier, 1990].
4.6 Bibliographical and Historical Remarks
To be added.
62 CHAPTER 4. NEURAL NETWORKS
Chapter 5
Statistical Learning
5.1 Using Statistical Decision Theory
5.1.1 Background and General Method
Suppose the pattern vector, X, is a random variable whose probability distri
bution for category 1 is diﬀerent than it is for category 2. (The treatment given
here can easily be generalized to Rcategory problems.) Speciﬁcally, suppose we
have the two probability distributions (perhaps probability density functions),
p(X  1) and p(X  2). Given a pattern, X, we want to use statistical tech
niques to determine its category—that is, to determine from which distribution
it was drawn. These techniques are based on the idea of minimizing the ex
pected value of a quantity similar to the error function we used in deriving the
weightchanging rules for backprop.
In developing a decision method, it is necessary to know the relative serious
ness of the two kinds of mistakes that might be made. (We might decide that a
pattern really in category 1 is in category 2, and vice versa.) We describe this
information by a loss function, λ(i  j), for i, j = 1, 2. λ(i  j) represents the loss
incurred when we decide a pattern is in category i when really it is in category
j. We assume here that λ(1  1) and λ(2  2) are both 0. For any given pattern,
X, we want to decide its category in such a way that minimizes the expected
value of this loss.
Given a pattern, X, if we decide category i, the expected value of the loss
will be:
L
X
(i) = λ(i  1)p(1  X) +λ(i  2)p(2  X)
where p(j  X) is the probability that given a pattern X, its category is j. Our
decision rule will be to decide that X belongs to category 1 if L
X
(1) ≤ L
X
(2),
and to decide on category 2 otherwise.
63
64 CHAPTER 5. STATISTICAL LEARNING
We can use Bayes’ Rule to get expressions for p(j  X) in terms of p(X  j),
which we assume to be known (or estimatible):
p(j  X) =
p(X  j)p(j)
p(X)
where p(j) is the (a priori) probability of category j (one category may be much
more probable than the other); and p(X) is the (a priori) probability of pattern
X being the pattern we are asked to classify. Performing the substitutions given
by Bayes’ Rule, our decision rule becomes:
Decide category 1 iﬀ:
λ(1  1)
p(X  1)p(1)
p(X)
+λ(1  2)
p(X  2)p(2)
p(X)
≤ λ(2  1)
p(X  1)p(1)
p(X)
+λ(2  2)
p(X  2)p(2)
p(X)
Using the fact that λ(i  i) = 0, and noticing that p(X) is common to both
expressions, we obtain,
Decide category 1 iﬀ:
λ(1  2)p(X  2)p(2) ≤ λ(2  1)p(X  1)p(1)
If λ(1  2) = λ(2  1) and if p(1) = p(2), then the decision becomes particu
larly simple:
Decide category 1 iﬀ:
p(X  2) ≤ p(X  1)
Since p(X  j) is called the likelihood of j with respect to X, this simple decision
rule implements what is called a maximumlikelihood decision. More generally,
if we deﬁne k(i  j) as λ(i  j)p(j), then our decision rule is simply,
Decide category1 iﬀ:
k(1  2)p(X  2) ≤ k(2  1)p(X  1)
In any case, we need to compare the (perhaps weighted) quantities p(X  i) for
i = 1 and 2. The exact decision rule depends on the the probability distributions
assumed. We will treat two interesting distributions.
5.1. USING STATISTICAL DECISION THEORY 65
5.1.2 Gaussian (or Normal) Distributions
The multivariate (ndimensional) Gaussian distribution is given by the proba
bility density function:
p(X) =
1
(2π)
n/2
Σ
1/2
e
−(X−M)
t
Σ
−1
(X−M)
2
where n is the dimension of the column vector X, the column vector M is called
the mean vector, (X − M)
t
is the transpose of the vector (X − M), Σ is the
covariance matrix of the distribution (an n × n symmetric, positive deﬁnite
matrix), Σ
−1
is the inverse of the covariance matrix, and Σ is the determinant
of the covariance matrix.
The mean vector, M, with components (m
1
, . . . , m
n
), is the expected value
of X (using this distribution); that is, M = E[X]. The components of the
covariance matrix are given by:
σ
2
ij
= E[(x
i
−m
i
)(x
j
−m
j
)]
In particular, σ
2
ii
is called the variance of x
i
.
Although the formula appears complex, an intuitive idea for Gaussian dis
tributions can be given when n = 2. We show a twodimensional Gaussian
distribution in Fig. 5.1. A threedimensional plot of the distribution is shown
at the top of the ﬁgure, and contours of equal probability are shown at the bot
tom. In this case, the covariance matrix, Σ, is such that the elliptical contours
of equal probability are skewed. If the covariance matrix were diagonal, that is
if all oﬀdiagonal terms were 0, then the major axes of the elliptical contours
would be aligned with the coordinate axes. In general the principal axes are
given by the eigenvectors of Σ. In any case, the equiprobability contours are
all centered on the mean vector, M, which in our ﬁgure happens to be at the
origin. In general, the formula in the exponent in the Gaussian distribution
is a positive deﬁnite quadratic form (that is, its value is always positive); thus
equiprobability contours are hyperellipsoids in ndimensional space.
Suppose we now assume that the two classes of pattern vectors that we
want to distinguish are each distributed according to a Gaussian distribution
but with diﬀerent means and covariance matrices. That is, one class tends to
have patterns clustered around one point in the ndimensional space, and the
other class tends to have patterns clustered around another point. We show a
twodimensional instance of this problem in Fig. 5.2. (In that ﬁgure, we have
plotted the sum of the two distributions.) What decision rule should we use to
separate patterns into the two appropriate categories?
Substituting the Gaussian distributions into our maximum likelihood for
mula yields:
66 CHAPTER 5. STATISTICAL LEARNING
5
0
5
5
0
5
0
0.25
0.5
0.75
1
5
0
5
5
0
5
0
25
.5
75
1
6 4 2 0 2 4 6
6
4
2
0
2
4
6
x
1
x
2
p(x
1
,x
2
)
2
4
6
2 4 6
x
1
x
2
Figure 5.1: The TwoDimensional Gaussian Distribution
Decide category 1 iﬀ:
1
(2π)
n/2
Σ
2

1/2
e
−1/2(X−M2)
t
Σ
−1
2
(X−M2)
is less than or equal to
1
(2π)
n/2
Σ
1

1/2
e
−1/2(X−M1)
t
Σ
−1
1
(X−M1)
where the category 1 patterns are distributed with mean and covariance M
1
and Σ
1
, respectively, and the category 2 patterns are distributed with mean
and covariance M
2
and Σ
2
.
The result of the comparison isn’t changed if we compare logarithms instead.
After some manipulation, our decision rule is then:
5.1. USING STATISTICAL DECISION THEORY 67
5
0
5
10
5
0
5
10
0
0.25
0.5
0.75
1
5
0
5
10
5
0
5
10
0
25
.5
75
1
x
1
x
2
p(x
1
,x
2
)
5 2.5 0 2.5 5 7.5 10
5
2.5
0
2.5
5
7.5
10
Figure 5.2: The Sum of Two Gaussian Distributions
Decide category 1 iﬀ:
(X−M
1
)
t
Σ
−1
1
(X−M
1
) < (X−M
2
)
t
Σ
−1
2
(X−M
2
) +B
where B, a constant bias term, incorporates the logarithms of the fractions
preceding the exponential, etc.
When the quadratic forms are multiplied out and represented in terms of
the components x
i
, the decision rule involves a quadric surface (a hyperquadric)
in ndimensional space. The exact shape and position of this hyperquadric is
determined by the means and the covariance matrices. The surface separates
the space into two parts, one of which contains points that will be assigned to
category 1 and the other contains points that will be assigned to category 2.
It is interesting to look at a special case of this surface. If the covariance
matrices for each category are identical and diagonal, with all σ
ii
equal to each
other, then the contours of equal probability for each of the two distributions
68 CHAPTER 5. STATISTICAL LEARNING
are hyperspherical. The quadric forms then become (1/Σ)(X−M
i
)
t
(X−M
i
),
and the decision rule is:
Decide category 1 iﬀ:
(X−M
1
)
t
(X−M
1
) < (X−M
2
)
t
(X−M
2
)
Multiplying out yields:
X•X−2X•M
1
+M
1
•M
1
< X•X−2X•M
2
+M
2
•M
2
or ﬁnally,
Decide category 1 iﬀ:
X•M
1
≥ X•M
2
+ Constant
or
X•(M
1
−M
2
) ≥ Constant
where the constant depends on the lengths of the mean vectors.
We see that the optimal decision surface in this special case is a hyperplane.
In fact, the hyperplane is perpendicular to the line joining the two means. The
weights in a TLU implementation are equal to the diﬀerence in the mean vectors.
If the parameters (M
i
, Σ
i
) of the probability distributions of the categories
are not known, there are various techniques for estimating them, and then using
those estimates in the decision rule. For example, if there are suﬃcient training
patterns, one can use sample means and sample covariance matrices. (Caution:
the sample covariance matrix will be singular if the training patterns happen to
lie on a subspace of the whole ndimensional space—as they certainly will, for
example, if the number of training patterns is less than n.)
5.1.3 Conditionally Independent Binary Components
Suppose the vector X is a random variable having binary (0,1) components.
We continue to denote the two probability distributions by p(X  1) and p(X 
2). Further suppose that the components of these vectors are conditionally
independent given the category. By conditional independence in this case, we
mean that the formulas for the distribution can be expanded as follows:
5.1. USING STATISTICAL DECISION THEORY 69
p(X  i) = p(x
1
 i)p(x
2
 i) · · · p(x
n
 i)
for i = 1, 2
Recall the minimumaverageloss decision rule,
Decide category 1 iﬀ:
λ(1  2)p(X  2)p(2) ≤ λ(2  1)p(X  1)p(1)
Assuming conditional independence of the components and that λ(1  2) = λ(2 
1), we obtain,
Decide category 1 iﬀ:
p(1)p(x
1
 1)p(x
2
 1) · · · p(x
n
 1) ≥ p(x
1
 2)p(x
2
 2) · · · p(x
n
 2)p(2)
or iﬀ:
p(x
1
 1)p(x
2
 1) . . . p(x
n
 1)
p(x
1
 2)p(x
2
 2) . . . p(x
n
 2)
≥
p(2)
p(1)
or iﬀ:
log
p(x
1
 1)
p(x
1
 2)
+ log
p(x
2
 1)
p(x
2
 2)
+· · · + log
p(x
n
 1)
p(x
n
 2)
+ log
p(1)
p(2)
≥ 0
Let us deﬁne values of the components of the distribution for speciﬁc values of
their arguments, x
i
:
p(x
i
= 1  1) = p
i
p(x
i
= 0  1) = 1 −p
i
p(x
i
= 1  2) = q
i
p(x
i
= 0  2) = 1 −q
i
Now, we note that since x
i
can only assume the values of 1 or 0:
log
p(x
i
 1)
p(x
i
 2)
= x
i
log
p
i
q
i
+ (1 −x
i
) log
(1 −p
i
)
(1 −q
i
)
70 CHAPTER 5. STATISTICAL LEARNING
= x
i
log
p
i
(1 −q
i
)
q
i
(1 −p
i
)
+ log
(1 −p
i
)
(1 −q
i
)
Substituting these expressions into our decision rule yields:
Decide category 1 iﬀ:
n
¸
i=1
x
i
log
p
i
(1 −q
i
)
q
i
(1 −p
i
)
+
n
¸
i=1
log
(1 −p
i
)
(1 −q
i
)
+ log
p(1)
p(2)
≥ 0
We see that we can achieve this decision with a TLU with weight values as
follows:
w
i
= log
p
i
(1 −q
i
)
q
i
(1 −p
i
)
for i = 1, . . . , n, and
w
n+1
= log
p(1)
1 −p(1)
+
n
¸
i=1
log
(1 −p
i
)
(1 −q
i
)
If we do not know the p
i
, q
i
and p(1), we can use a sample of labeled training
patterns to estimate these parameters.
5.2 Learning Belief Networks
To be added.
5.3 NearestNeighbor Methods
Another class of methods can be related to the statistical ones. These are called
nearestneighbor methods or, sometimes, memorybased methods. (A collection
of papers on this subject is in [Dasarathy, 1991].) Given a training set Ξ of m
labeled patterns, a nearestneighbor procedure decides that some new pattern,
X, belongs to the same category as do its closest neighbors in Ξ. More precisely,
a knearestneighbor method assigns a new pattern, X, to that category to which
the plurality of its k closest neighbors belong. Using relatively large values of
k decreases the chance that the decision will be unduly inﬂuenced by a noisy
training pattern close to X. But large values of k also reduce the acuity of the
method. The knearestneighbor method can be thought of as estimating the
values of the probabilities of the classes given X. Of course the denser are the
points around X, and the larger the value of k, the better the estimate.
5.3. NEARESTNEIGHBOR METHODS 71
The distance metric used in nearestneighbor methods (for numerical at
tributes) can be simple Euclidean distance. That is, the distance between two
patterns (x
11
, x
12
, . . . , x
1n
) and (x
21
, x
22
, . . . , x
2n
) is
¸
n
j=1
(x
1j
−x
2j
)
2
. This
distance measure is often modiﬁed by scaling the features so that the spread of
attribute values along each dimension is approximately the same. In that case,
the distance between the two vectors would be
¸
n
j=1
a
2
j
(x
1j
−x
2j
)
2
, where
a
j
is the scale factor for dimension j.
An example of a nearestneighbor decision problem is shown in Fig. 5.3. In
the ﬁgure the class of a training pattern is indicated by the number next to it.
k = 8
X (a pattern to be classified)
1
1
1
1
1
1 1
1
2
1
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
training pattern
class of training pattern
four patterns of category 1
two patterns of category 2
two patterns of category 3
plurality are in category 1, so
decide X is in category 1
Figure 5.3: An 8NearestNeighbor Decision
See [Baum, 1994] for theoretical
analysis of error rate as a function
of the number of training patterns
for the case in which points are
randomly distributed on the surface
of a unit sphere and underlying
function is linearly separable.
Nearestneighbor methods are memory intensive because a large number of
training patterns must be stored to achieve good generalization. Since memory
cost is now reasonably low, the method and its derivatives have seen several
practical applications. (See, for example, [Moore, 1992, Moore, et al., 1994].
Also, the distance calculations required to ﬁnd nearest neighbors can often be
eﬃciently computed by kdtree methods [Friedman, et al., 1977].
A theorem by Cover and Hart [Cover & Hart, 1967] relates the performance
of the 1nearestneighbor method to the performance of a minimumprobability
oferror classiﬁer. As mentioned earlier, the minimumprobabilityoferror clas
siﬁer would assign a new pattern Xto that category that maximized p(i)p(X  i),
where p(i) is the a priori probability of category i, and p(X  i) is the probability
(or probability density function) of X given that X belongs to category i, for
categories i = 1, . . . , R. Suppose the probability of error in classifying patterns
of such a minimumprobabilityoferror classiﬁer is ε. The CoverHart theo
rem states that under very mild conditions (having to do with the smoothness
72 CHAPTER 5. STATISTICAL LEARNING
of probability density functions) the probability of error, ε
nn
, of a 1nearest
neighbor classiﬁer is bounded by:
ε ≤ ε
nn
≤ ε
2 −ε
R
R −1
≤ 2ε
where R is the number of categories. Also see [Aha, 1991].
5.4 Bibliographical and Historical Remarks
To be added.
Chapter 6
Decision Trees
6.1 Deﬁnitions
A decision tree (generally deﬁned) is a tree whose internal nodes are tests (on
input patterns) and whose leaf nodes are categories (of patterns). We show an
example in Fig. 6.1. A decision tree assigns a class number (or output) to an
input pattern by ﬁltering the pattern down through the tests in the tree. Each
test has mutually exclusive and exhaustive outcomes. For example, test T
2
in
the tree of Fig. 6.1 has three outcomes; the leftmost one assigns the input
pattern to class 3, the middle one sends the input pattern down to test T
4
, and
the rightmost one assigns the pattern to class 1. We follow the usual convention
of depicting the leaf nodes by the class number.
1
Note that in discussing decision
trees we are not limited to implementing Boolean functions—they are useful for
general, categorically valued functions.
There are several dimensions along which decision trees might diﬀer:
a. The tests might be multivariate (testing on several features of the input
at once) or univariate (testing on only one of the features).
b. The tests might have two outcomes or more than two. (If all of the tests
have two outcomes, we have a binary decision tree.)
c. The features or attributes might be categorical or numeric. (Binaryvalued
ones can be regarded as either.)
1
One of the researchers who has done a lot of work on learning decision trees is Ross
Quinlan. Quinlan distinguishes between classes and categories. He calls the subsets of patterns
that ﬁlter down to each tip categories and subsets of patterns having the same label classes.
In Quinlan’s terminology, our example tree has nine categories and three classes. We will not
make this distinction, however, but will use the words “category” and “class” interchangeably
to refer to what Quinlan calls “class.”
73
74 CHAPTER 6. DECISION TREES
T
1
T
2
T
3
T
4
T
4
T
4
3
1
3
2
1
2 3
2
1
Figure 6.1: A Decision Tree
d. We might have two classes or more than two. If we have two classes and
binary inputs, the tree implements a Boolean function, and is called a
Boolean decision tree.
It is straightforward to represent the function implemented by a univariate
Boolean decision tree in DNF form. The DNF form implemented by such a tree
can be obtained by tracing down each path leading to a tip node corresponding
to an output value of 1, forming the conjunction of the tests along this path,
and then taking the disjunction of these conjunctions. We show an example in
Fig. 6.2. In drawing univariate decision trees, each nonleaf node is depicted by
a single attribute. If the attribute has value 0 in the input pattern, we branch
left; if it has value 1, we branch right.
The kDL class of Boolean functions can be implemented by a multivariate
decision tree having the (highly unbalanced) form shown in Fig. 6.3. Each test,
c
i
, is a term of size k or less. The v
i
all have values of 0 or 1.
6.2 Supervised Learning of Univariate Decision
Trees
Several systems for learning decision trees have been proposed. Prominent
among these are ID3 and its new version, C4.5 [Quinlan, 1986, Quinlan, 1993],
and CART [Breiman, et al., 1984] We discuss here only batch methods, al
though incremental ones have also been proposed [Utgoﬀ, 1989].
6.2. SUPERVISED LEARNING OF UNIVARIATE DECISION TREES 75
x
3
x
2
x
4
x
1
1
0
1
1
0
0
0
1
x
3
x
2
x
3
x
2
x
3
x
4
x
3
x
4
x
1 x
3
x
4
x
1
f = x
3
x
2
+ x
3
x
4
x
1
1
0
0
1
0
Figure 6.2: A Decision Tree Implementing a DNF Function
6.2.1 Selecting the Type of Test
As usual, we have n features or attributes. If the attributes are binary, the
tests are simply whether the attribute’s value is 0 or 1. If the attributes are
categorical, but nonbinary, the tests might be formed by dividing the attribute
values into mutually exclusive and exhaustive subsets. A decision tree with such
tests is shown in Fig. 6.4. If the attributes are numeric, the tests might involve
“interval tests,” for example 7 ≤ x
i
≤ 13.2.
6.2.2 Using Uncertainty Reduction to Select Tests
The main problem in learning decision trees for the binaryattribute case is
selecting the order of the tests. For categorical and numeric attributes, we
must also decide what the tests should be (besides selecting the order). Several
techniques have been tried; the most popular one is at each stage to select that
test that maximally reduces an entropylike measure.
We show how this technique works for the simple case of tests with binary
outcomes. Extension to multipleoutcome tests is straightforward computation
ally but gives poor results because entropy is always decreased by having more
outcomes.
The entropy or uncertainty still remaining about the class of a pattern—
knowing that it is in some set, Ξ, of patterns is deﬁned as:
H(Ξ) = −
¸
i
p(iΞ) log
2
p(iΞ)
76 CHAPTER 6. DECISION TREES
c
q
c
q1
c
i
1
v
n
v
n1
v
i
v
1
Figure 6.3: A Decision Tree Implementing a Decision List
where p(iΞ) is the probability that a pattern drawn at random from Ξ belongs
to class i, and the summation is over all of the classes. We want to select tests at
each node such that as we travel down the decision tree, the uncertainty about
the class of a pattern becomes less and less.
Since we do not in general have the probabilities p(iΞ), we estimate them by
sample statistics. Although these estimates might be errorful, they are never
theless useful in estimating uncertainties. Let ˆ p(iΞ) be the number of patterns
in Ξ belonging to class i divided by the total number of patterns in Ξ. Then an
estimate of the uncertainty is:
ˆ
H(Ξ) = −
¸
i
ˆ p(iΞ) log
2
ˆ p(iΞ)
For simplicity, from now on we’ll drop the “hats” and use sample statistics as
if they were real probabilities.
If we perform a test, T, having k possible outcomes on the patterns in Ξ, we
will create k subsets, Ξ
1
, Ξ
2
, . . . , Ξ
k
. Suppose that n
i
of the patterns in Ξ are in
Ξ
i
for i = 1, ..., k. (Some n
i
may be 0.) If we knew that T applied to a pattern
in Ξ resulted in the jth outcome (that is, we knew that the pattern was in Ξ
j
),
the uncertainty about its class would be:
H(Ξ
j
) = −
¸
i
p(iΞ
j
) log
2
p(iΞ
j
)
and the reduction in uncertainty (beyond knowing only that the pattern was in
Ξ) would be:
6.2. SUPERVISED LEARNING OF UNIVARIATE DECISION TREES 77
x
3
= a, b, c, or d
{a, c}
{b}
x
1
= e, b, or d
{e,b}
{d}
x
4
= a, e, f, or g
{a, g}
{e, f}
x
2
= a, or g
{a}
{g}
1
2
1
1 2
{d}
2
Figure 6.4: A Decision Tree with Categorical Attributes
H(Ξ) −H(Ξ
j
)
Of course we cannot say that the test T is guaranteed always to produce that
amount of reduction in uncertainty because we don’t know that the result of
the test will be the jth outcome. But we can estimate the average uncertainty
over all the Ξ
j
, by:
E[H
T
(Ξ)] =
¸
j
p(Ξ
j
)H(Ξ
j
)
where by H
T
(Ξ) we mean the average uncertainty after performing test T on
the patterns in Ξ, p(Ξ
j
) is the probability that the test has outcome j, and the
sum is taken from 1 to k. Again, we don’t know the probabilities p(Ξ
j
), but we
can use sample values. The estimate ˆ p(Ξ
j
) of p(Ξ
j
) is just the number of those
patterns in Ξ that have outcome j divided by the total number of patterns in
Ξ. The average reduction in uncertainty achieved by test T (applied to patterns
in Ξ) is then:
R
T
(Ξ) = H(Ξ) −E[H
T
(Ξ)]
An important family of decision tree learning algorithms selects for the root
of the tree that test that gives maximum reduction of uncertainty, and then
applies this criterion recursively until some termination condition is met (which
we shall discuss in more detail later). The uncertainty calculations are particu
larly simple when the tests have binary outcomes and when the attributes have
78 CHAPTER 6. DECISION TREES
binary values. We’ll give a simple example to illustrate how the test selection
mechanism works in that case.
Suppose we want to use the uncertaintyreduction method to build a decision
tree to classify the following patterns:
pattern class
(0, 0, 0) 0
(0, 0, 1) 0
(0, 1, 0) 0
(0, 1, 1) 0
(1, 0, 0) 0
(1, 0, 1) 1
(1, 1, 0) 0
(1, 1, 1) 1
What single test, x
1
, x
2
, or x
3
, should be performed ﬁrst? The illustration in
Fig. 6.5 gives geometric intuition about the problem.
x
1
x
2
x
3
The test x
1
Figure 6.5: Eight Patterns to be Classiﬁed by a Decision Tree
The initial uncertainty for the set, Ξ, containing all eight points is:
H(Ξ) = −(6/8) log
2
(6/8) −(2/8) log
2
(2/8) = 0.81
Next, we calculate the uncertainty reduction if we perform x
1
ﬁrst. The left
hand branch has only patterns belonging to class 0 (we call them the set Ξ
l
), and
the righthandbranch (Ξ
r
) has two patterns in each class. So, the uncertainty
of the lefthand branch is:
6.3. NETWORKS EQUIVALENT TO DECISION TREES 79
H
x1
(Ξ
l
) = −(4/4) log
2
(4/4) −(0/4) log
2
(0/4) = 0
And the uncertainty of the righthand branch is:
H
x1
(Ξ
r
) = −(2/4) log
2
(2/4) −(2/4) log
2
(2/4) = 1
Half of the patterns “go left” and half “go right” on test x
1
. Thus, the average
uncertainty after performing the x
1
test is:
1/2H
x1
(Ξ
l
) + 1/2H
x1
(Ξ
r
) = 0.5
Therefore the uncertainty reduction on Ξ achieved by x
1
is:
R
x1
(Ξ) = 0.81 −0.5 = 0.31
By similar calculations, we see that the test x
3
achieves exactly the same
uncertainty reduction, but x
2
achieves no reduction whatsoever. Thus, our
“greedy” algorithm for selecting a ﬁrst test would select either x
1
or x
3
. Suppose
x
1
is selected. The uncertaintyreduction procedure would select x
3
as the next
test. The decision tree that this procedure creates thus implements the Boolean
function: f = x
1
x
3
. See [Quinlan, 1986, sect. 4] for
another example.
6.2.3 NonBinary Attributes
If the attributes are nonbinary, we can still use the uncertaintyreduction tech
nique to select tests. But now, in addition to selecting an attribute, we must
select a test on that attribute. Suppose for example that the value of an at
tribute is a real number and that the test to be performed is to set a threshold
and to test to see if the number is greater than or less than that threshold. In
principle, given a set of labeled patterns, we can measure the uncertainty reduc
tion for each test that is achieved by every possible threshold (there are only
a ﬁnite number of thresholds that give diﬀerent test results if there are only
a ﬁnite number of training patterns). Similarly, if an attribute is categorical
(with a ﬁnite number of categories), there are only a ﬁnite number of mutually
exclusive and exhaustive subsets into which the values of the attribute can be
split. We can calculate the uncertainty reduction for each split.
6.3 Networks Equivalent to Decision Trees
Since univariate Boolean decision trees are implementations of DNF functions,
they are also equivalent to twolayer, feedforward neural networks. We show
an example in Fig. 6.6. The decision tree at the left of the ﬁgure implements
80 CHAPTER 6. DECISION TREES
the same function as the network at the right of the ﬁgure. Of course, when
implemented as a network, all of the features are evaluated in parallel for any
input pattern, whereas when implemented as a decision tree only those features
on the branch traveled down by the input pattern need to be evaluated. The
decisiontree induction methods discussed in this chapter can thus be thought of
as particular ways to establish the structure and the weight values for networks.
X
x
1
x
2
x
3
x
4
terms
1
+1
disjunction
x
3
x
2
x
3
x
4
x
1
+1
1
+1
f
1.5
0.5
x
3
x
2
x
4
x
1
1 0
1
1
0
0
0
1
x
3
x
2
x
3
x
2
x
3
x
4
x
3
x
4
x
1
x
3
x
4
x
1
f = x
3
x
2
+ x
3
x
4
x
1
1
0
0
1
0
Figure 6.6: A Univariate Decision Tree and its Equivalent Network
Multivariate decision trees with linearly separable functions at each node can
also be implemented by feedforward networks—in this case threelayer ones. We
show an example in Fig. 6.7 in which the linearly separable functions, each im
plemented by a TLU, are indicated by L
1
, L
2
, L
3
, and L
4
. Again, the ﬁnal layer
has ﬁxed weights, but the weights in the ﬁrst two layers must be trained. Dif
ferent approaches to training procedures have been discussed by [Brent, 1990],
by [John, 1995], and (for a special case) by [Marchand & Golea, 1993].
6.4 Overﬁtting and Evaluation
6.4.1 Overﬁtting
In supervised learning, we must choose a function to ﬁt the training set from
among a set of hypotheses. We have already showed that generalization is
impossible without bias. When we know a priori that the function we are
trying to guess belongs to a small subset of all possible functions, then, even
with an incomplete set of training samples, it is possible to reduce the subset
of functions that are consistent with the training set suﬃciently to make useful
guesses about the value of the function for inputs not in the training set. And,
6.4. OVERFITTING AND EVALUATION 81
L
1
L
2
L
3
L
4
1 0
1
1
0
0
0
1
1
0
0
1
0
X
L
1
L
2
L
3
L
4
conjunctions
L
1
L
2
L
1
L
3
L
4
<
+
+
+
disjunction
<
f
Figure 6.7: A Multivariate Decision Tree and its Equivalent Network
the larger the training set, the more likely it is that even a randomly selected
consistent function will have appropriate outputs for patterns not yet seen.
However, even with bias, if the training set is not suﬃciently large compared
with the size of the hypothesis space, there will still be too many consistent
functions for us to make useful guesses, and generalization performance will be
poor. When there are too many hypotheses that are consistent with the training
set, we say that we are overﬁtting the training data. Overﬁtting is a problem
that we must address for all learning methods.
Since a decision tree of suﬃcient size can implement any Boolean function
there is a danger of overﬁtting—especially if the training set is small. That
is, even if the decision tree is synthesized to classify all the members of the
training set correctly, it might perform poorly on new patterns that were not
used to build the decision tree. Several techniques have been proposed to avoid
overﬁtting, and we shall examine some of them here. They make use of methods
for estimating how well a given decision tree might generalize—methods we shall
describe next.
6.4.2 Validation Methods
The most straightforward way to estimate how well a hypothesized function
(such as a decision tree) performs on a test set is to test it on the test set! But,
if we are comparing several learning systems (for example, if we are comparing
diﬀerent decision trees) so that we can select the one that performs the best on
the test set, then such a comparison amounts to “training on the test data.”
True, training on the test data enlarges the training set, with a consequent ex
pected improvement in generalization, but there is still the danger of overﬁtting
if we are comparing several diﬀerent learning systems. Another technique is to
82 CHAPTER 6. DECISION TREES
split the training set—using (say) twothirds for training and the other third
for estimating generalization performance. But splitting reduces the size of the
training set and thereby increases the possibility of overﬁtting. We next describe
some validation techniques that attempt to avoid these problems.
CrossValidation
In crossvalidation, we divide the training set Ξ into K mutually exclusive and
exhaustive equalsized subsets: Ξ
1
, . . . , Ξ
K
. For each subset, Ξ
i
, train on the
union of all of the other subsets, and empirically determine the error rate, ε
i
,
on Ξ
i
. (The error rate is the number of classiﬁcation errors made on Ξ
i
divided
by the number of patterns in Ξ
i
.) An estimate of the error rate that can be
expected on new patterns of a classiﬁer trained on all the patterns in Ξ is then
the average of the ε
i
.
Leaveoneout Validation
Leaveoneout validation is the same as cross validation for the special case in
which K equals the number of patterns in Ξ, and each Ξ
i
consists of a single
pattern. When testing on each Ξ
i
, we simply note whether or not a mistake
was made. We count the total number of mistakes and divide by K to get
the estimated error rate. This type of validation is, of course, more expensive
computationally, but useful when a more accurate estimate of the error rate for
a classiﬁer is needed. Describe “bootstrapping” also
[Efron, 1982].
6.4.3 Avoiding Overﬁtting in Decision Trees
Near the tips of a decision tree there may be only a few patterns per node.
For these nodes, we are selecting a test based on a very small sample, and thus
we are likely to be overﬁtting. This problem can be dealt with by terminating
the testgenerating procedure before all patterns are perfectly split into their
separate categories. That is, a leaf node may contain patterns of more than one
class, but we can decide in favor of the most numerous class. This procedure
will result in a few errors but often accepting a small number of errors on the
training set results in fewer errors on a testing set.
This behavior is illustrated in Fig. 6.8.
One can use crossvalidation techniques to determine when to stop splitting
nodes. If the cross validation error increases as a consequence of a node split,
then don’t split. One has to be careful about when to stop, though, because
underﬁtting usually leads to more errors on test sets than does overﬁtting. There
is a general rule that the lowest errorrate attainable by a subtree of a fully
expanded tree can be no less than 1/2 of the error rate of the fully expanded
tree [Weiss & Kulikowski, 1991, page 126].
6.4. OVERFITTING AND EVALUATION 83
(From Weiss, S., and Kulikowski, C., Computer Systems that Learn,
Morgan Kaufmann, 1991)
training errors
validation errors
1 2 3 4 5 6 7 8 9
0.2
0.4
0.6
0.8
1.0
0
0
Error Rate
Number of Terminal
Nodes
Iris Data Decision Tree
Figure 6.8: Determining When Overﬁtting Begins
Rather than stopping the growth of a decision tree, one might grow it to
its full size and then prune away leaf nodes and their ancestors until cross
validation accuracy no longer increases. This technique is called postpruning.
Various techniques for pruning are discussed in [Weiss & Kulikowski, 1991].
6.4.4 MinimumDescription Length Methods
An important treegrowing and pruning technique is based on the minimum
descriptionlength (MDL) principle. (MDL is an important idea that extends
beyond decisiontree methods [Rissanen, 1978].) The idea is that the simplest
decision tree that can predict the classes of the training patterns is the best
one. Consider the problem of transmitting just the labels of a training set of
patterns, assuming that the receiver of this information already has the ordered
set of patterns. If there are m patterns, each labeled by one of R classes,
one could transmit a list of m Rvalued numbers. Assuming equally probable
classes, this transmission would require mlog
2
R bits. Or, one could transmit a
decision tree that correctly labelled all of the patterns. The number of bits that
this transmission would require depends on the technique for encoding decision
trees and on the size of the tree. If the tree is small and accurately classiﬁes
all of the patterns, it might be more economical to transmit the tree than to
transmit the labels directly. In between these extremes, we might transmit a
tree plus a list of labels of all the patterns that the tree misclassiﬁes.
In general, the number of bits (or description length of the binary encoded
message) is t + d, where t is the length of the message required to transmit
the tree, and d is the length of the message required to transmit the labels of
84 CHAPTER 6. DECISION TREES
the patterns misclassiﬁed by the tree. In a sense, that tree associated with the
smallest value of t + d is the best or most economical tree. The MDL method
is one way of adhering to the Occam’s razor principle.
Quinlan and Rivest [Quinlan & Rivest, 1989] have proposed techniques for
encoding decision trees and lists of exception labels and for calculating the
description length (t+d) of these trees and labels. They then use the description
length as a measure of quality of a tree in two ways:
a. In growing a tree, they use the reduction in description length to select
tests (instead of reduction in uncertainty).
b. In pruning a tree after it has been grown to zero error, they prune away
those nodes (starting at the tips) that achieve a decrease in the description
length.
These techniques compare favorably with the uncertaintyreduction method,
although they are quite sensitive to the coding schemes used.
6.4.5 Noise in Data
Noise in the data means that one must inevitably accept some number of
errors—depending on the noise level. Refusal to tolerate errors on the training
set when there is noise leads to the problem of “ﬁtting the noise.” Dealing with
noise, then, requires accepting some errors at the leaf nodes just as does the
fact that there are a small number of patterns at leaf nodes.
6.5 The Problem of Replicated Subtrees
Decision trees are not the most economical means of implementing some Boolean
functions. Consider, for example, the function f = x
1
x
2
+x
3
x
4
. A decision tree
for this function is shown in Fig. 6.9. Notice the replicated subtrees shown
circled. The DNFform equivalent to the function implemented by this decision
tree is f = x
1
x
2
+ x
1
x
2
x
3
x
4
+ x
1
x
3
x
4
. This DNF form is nonminimal (in the
number of disjunctions) and is equivalent to f = x
1
x
2
+x
3
x
4
.
The need for replication means that it takes longer to learn the tree and
that subtrees replicated further down the tree must be learned using a smaller
training subset. This problem is sometimes called the fragmentation problem.
Several approaches might be suggested for dealing with fragmenta
tion. One is to attempt to build a decision graph instead of a tree
[Oliver, Dowe, & Wallace, 1992, Kohavi, 1994]. A decision graph that imple
ments the same decisions as that of the decision tree of Fig. 6.9 is shown in Fig.
6.10.
Another approach is to use multivariate (rather than univariate tests at each
node). In our example of learning f = x
1
x
2
+ x
3
x
4
, if we had a test for x
1
x
2
6.6. THE PROBLEM OF MISSING ATTRIBUTES 85
x
1
x
3
x
2
1
0
x
4
0
1
x
3
0
x
4
0
1
Figure 6.9: A Decision Tree with Subtree Replication
and a test for x
3
x
4
, the decision tree could be much simpliﬁed, as shown in Fig.
6.11. Several researchers have proposed techniques for learning decision trees in
which the tests at each node are linearly separable functions. [John, 1995] gives
a nice overview (with several citations) of learning such linear discriminant trees
and presents a method based on “soft entropy.”
A third method for dealing with the replicated subtree problem involves ex
tracting propositional “rules” from the decision tree. The rules will have as an
tecedents the conjunctions that lead down to the leaf nodes, and as consequents
the name of the class at the corresponding leaf node. An example rule from the
tree with the repeating subtree of our example would be: x
1
∧¬x
2
∧x
3
∧x
4
⊃ 1.
Quinlan [Quinlan, 1987] discusses methods for reducing a set of rules to a sim
pler set by 1) eliminating from the antecedent of each rule any “unnecessary”
conjuncts, and then 2) eliminating “unnecessary” rules. A conjunct or rule is
determined to be unnecessary if its elimination has little eﬀect on classiﬁcation
accuracy—as determined by a chisquare test, for example. After a rule set is
processed, it might be the case that more than one rule is “active” for any given
pattern, and care must be taken that the active rules do not conﬂict in their
decision about the class of a pattern.
86 CHAPTER 6. DECISION TREES
x
1
x
3
x
2
1
0
x
4
0
1
Figure 6.10: A Decision Graph
6.6 The Problem of Missing Attributes
To be added.
6.7 Comparisons
Several experimenters have compared decisiontree, neuralnet, and nearest
neighbor classiﬁers on a wide variety of problems. For a comparison of
neural nets versus decision trees, for example, see [Dietterich, et al., 1990,
Shavlik, Mooney, & Towell, 1991, Quinlan, 1994]. In their StatLog project,
[Taylor, Michie, & Spiegalhalter, 1994] give thorough comparisons of several
machine learning algorithms on several diﬀerent types of problems. There seems
x
1
x
2
1
0
x
3
x
4
1
Figure 6.11: A Multivariate Decision Tree
6.8. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 87
to be no single type of classiﬁer that is best for all problems. And, there do
not seem to be any general conclusions that would enable one to say which
classiﬁer method is best for which sorts of classiﬁcation problems, although
[Quinlan, 1994] does provide some intuition about properties of problems that
might render them ill suited for decision trees, on the one hand, or backpropa
gation, on the other.
6.8 Bibliographical and Historical Remarks
To be added.
88 CHAPTER 6. DECISION TREES
Chapter 7
Inductive Logic
Programming
There are many diﬀerent representational forms for functions of input vari
ables. So far, we have seen (Boolean) algebraic expressions, decision trees, and
neural networks, plus other computational mechanisms such as techniques for
computing nearest neighbors. Of course, the representation most important
in computer science is a computer program. For example, a Lisp predicate of
binaryvalued inputs computes a Boolean function of those inputs. Similarly, a
logic program (whose ordinary application is to compute bindings for variables)
can also be used simply to decide whether or not a predicate has value True
(T) or False (F). For example, the Boolean exclusiveor (odd parity) function
of two variables can be computed by the following logic program:
Parity(x,y) : True(x), ¬ True(y)
: True(y), ¬ True(x)
We follow Prolog syntax (see, for example, [Mueller & Page, 1988]), except that
our convention is to write variables as strings beginning with lowercase letters
and predicates as strings beginning with uppercase letters. The unary function
“True” returns T if and only if the value of its argument is T. (We now think
of Boolean functions and arguments as having values of T and F instead of 0
and 1.) Programs will be written in “typewriter” font.
In this chapter, we consider the matter of learning logic programs given
a set of variable values for which the logic program should return T (the
positive instances) and a set of variable values for which it should return
F (the negative instances). The subspecialty of machine learning that deals
with learning logic programs is called inductive logic programming (ILP)
[Lavraˇc & Dˇzeroski, 1994]. As with any learning problem, this one can be quite
complex and intractably diﬃcult unless we constrain it with biases of some sort.
89
90 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
In ILP, there are a variety of possible biases (called language biases). One might
restrict the program to Horn clauses, not allow recursion, not allow functions,
and so on.
As an example of an ILP problem, suppose we are trying to induce a func
tion Nonstop(x,y), that is to have value T for pairs of cities connected by a
nonstop air ﬂight and F for all other pairs of cities. We are given a training set
consisting of positive and negative examples. As positive examples, we might
have (A,B), (A, A1), and some other pairs; as negative examples, we might
have (A1, A2), and some other pairs. In ILP, we usually have additional infor
mation about the examples, called “background knowledge.” In our airﬂight
problem, the background information might be such ground facts as Hub(A),
Hub(B), Satellite(A1,A), plus others. (Hub(A) is intended to mean that the
city denoted by A is a hub city, and Satellite(A1,A) is intended to mean that
the city denoted by A1 is a satellite of the city denoted by A.) From these train
ing facts, we want to induce a program Nonstop(x,y), written in terms of the
background relations Hub and Satellite, that has value T for all the positive
instances and has value F for all the negative instances. Depending on the exact
set of examples, we might induce the program:
Nonstop(x,y) : Hub(x), Hub(y)
: Satellite(x,y)
: Satellite(y,x)
which would have value T if both of the two cities were hub cities or if one were
a satellite of the other. As with other learning problems, we want the induced
program to generalize well; that is, if presented with arguments not represented
in the training set (but for which we have the needed background knowledge),
we would like the function to guess well.
7.1 Notation and Deﬁnitions
In evaluating logic programs in ILP, we implicitly append the background facts
to the program and adopt the usual convention that a program has value T for
a set of inputs if and only if the program interpreter returns T when actually
running the program (with background facts appended) on those inputs; oth
erwise it has value F. Using the given background facts, the program above
would return T for input (A, A1), for example. If a logic program, π, returns
T for a set of arguments X, we say that the program covers the arguments and
write covers(π, X). Following our terminology introduced in connection with
version spaces, we will say that a program is suﬃcient if it covers all of the
positive instances and that it is necessary if it does not cover any of the neg
ative instances. (That is, a program implements a suﬃcient condition that a
training instance is positive if it covers all of the positive training instances; it
7.2. A GENERIC ILP ALGORITHM 91
implements a necessary condition if it covers none of the negative instances.) In
the noiseless case, we want to induce a program that is both suﬃcient and nec
essary, in which case we will call it consistent. With imperfect (noisy) training
sets, we might relax this criterion and settle for a program that covers all but
some fraction of the positive instances while allowing it to cover some fraction
of the negative instances. We illustrate these deﬁnitions schematically in Fig.
7.1.
<
<
<
<
<
<
<
/
1
is a necessary program
/
2
is a sufficient program
/
3
is a consistent program
+
+
+
+
+
+
+
+
+
+
<
<
A positive instance
covered by /
2
and /
3
Figure 7.1: Suﬃcient, Necessary, and Consistent Programs
As in version spaces, if a program is suﬃcient but not necessary it can be
made to cover fewer examples by specializing it. Conversely, if it is necessary
but not suﬃcient, it can be made to cover more examples by generalizing it.
Suppose we are attempting to induce a logic program to compute the relation
ρ. The most general logic program, which is certainly suﬃcient, is the one that
has value T for all inputs, namely a single clause with an empty body, [ρ :
], which is called a fact in Prolog. The most special logic program, which is
certainly necessary, is the one that has value F for all inputs, namely [ρ :
F ]. Two of the many diﬀerent ways to search for a consistent logic program
are: 1) start with [ρ : ] and specialize until the program is consistent, or 2)
start with [ρ : F ] and generalize until the program is consistent. We will
be discussing a method that starts with [ρ : ], specializes until the program
is necessary (but might no longer be suﬃcient), then reachieves suﬃciency in
stages by generalizing—ensuring within each stage that the program remains
necessary (by specializing).
7.2 A Generic ILP Algorithm
Since the primary operators in our search for a consistent program are special
ization and generalization, we must next discuss those operations. There are
92 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
three major ways in which a logic program might be generalized:
a. Replace some terms in a program clause by variables. (Readers familiar
with substitutions in the predicate calculus will note that this process is
the inverse of substitution.)
b. Remove literals from the body of a clause.
c. Add a clause to the program
Analogously, there are three ways in which a logic program might be specialized:
a. Replace some variables in a program clause by terms (a substitution).
b. Add literals to the body of a clause.
c. Remove a clause from the program
We will be presenting an ILP learning method that adds clauses to a program
when generalizing and that adds literals to the body of a clause when special
izing. When we add a clause, we will always add the clause [ρ : ] and then
specialize it by adding literals to the body. Thus, we need only describe the
process for adding literals.
Clauses can be partially ordered by the specialization relation. In general,
clause c
1
is more special than clause c
2
if c
2
= c
1
. A special case, which is what
we use here, is that a clause c
1
is more special than a clause c
2
if the set of
literals in the body of c
2
is a subset of those in c
1
. This ordering relation can
be used in a structure of partially ordered clauses, called the reﬁnement graph,
that is similar to a version space. Clause c
1
is an immediate successor of clause
c
2
in this graph if and only if clause c
1
can be obtained from clause c
2
by adding
a literal to the body of c
2
. A reﬁnement graph then tells us the ways in which
we can specialize a clause by adding a literal to it.
Of course there are unlimited possible literals we might add to the body of
a clause. Practical ILP systems restrict the literals in various ways. Typical
allowed additions are:
a. Literals used in the background knowledge.
b. Literals whose arguments are a subset of those in the head of the clause.
c. Literals that introduce a new distinct variable diﬀerent from those in the
head of the clause.
d. A literal that equates a variable in the head of the clause with another
such variable or with a term mentioned in the background knowledge.
(This possibility is equivalent to forming a specialization by making a
substitution.)
7.2. A GENERIC ILP ALGORITHM 93
e. A literal that is the same (except for its arguments) as that in the head
of the clause. (This possibility admits recursive programs, which are dis
allowed in some systems.)
We can illustrate these possibilities using our airﬂight example. We start
with the program [Nonstop(x,y) : ]. The literals used in the background
knowledge are Hub and Satellite. Thus the literals that we might consider
adding are:
Hub(x)
Hub(y)
Hub(z)
Satellite(x,y)
Satellite(y,x)
Satellite(x,z)
Satellite(z,y)
(x = y)
(If recursive programs are allowed, we could also add the literals Nonstop(x,z)
and Nonstop(z,y).) These possibilities are among those illustrated in the re
ﬁnement graph shown in Fig. 7.2. Whatever restrictions on additional literals
are imposed, they are all syntactic ones from which the successors in the reﬁne
ment graph are easily computed. ILP programs that follow the approach we
are discussing (of specializing clauses by adding a literal) thus have well deﬁned
methods of computing the possible literals to add to a clause.
Now we are ready to write down a simple generic algorithm for inducing a
logic program, π for inducing a relation ρ. We are given a training set, Ξ of
argument sets some known to be in the relation ρ and some not in ρ; Ξ
+
are
the positive instances, and Ξ
−
are the negative instances. The algorithm has
an outer loop in which it successively adds clauses to make π more and more
suﬃcient. It has an inner loop for constructing a clause, c, that is more and
more necessary and in which it refers only to a subset, Ξ
cur
, of the training
instances. (The positive instances in Ξ
cur
will be denoted by Ξ
+
cur
, and the
negative ones by Ξ
−
cur
.) The algorithm is also given background relations and
the means for adding literals to a clause. It uses a logic program interpreter to
compute whether or not the program it is inducing covers training instances.
The algorithm can be written as follows:
Generic ILP Algorithm
(Adapted from [Lavraˇc & Dˇzeroski, 1994, p. 60].)
94 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
Nonstop(x,y) :
Nonstop(x,y) :
Hub(x)
Nonstop(x,y) :
Satellite(x,y)
Nonstop(x,y) :
(x = y)
. . .
. . .
. . . . . .
Nonstop(x,y) : Hub(x), Hub(y)
. . .
. . .
. . .
Figure 7.2: Part of a Reﬁnement Graph
Initialize Ξ
cur
:= Ξ.
Initialize π := empty set of clauses.
repeat [The outer loop works to make π suﬃcient.]
Initialize c := ρ : − .
repeat [The inner loop makes c necessary.]
Select a literal l to add to c. [This is a nondeterministic choice point.]
Assign c := c, l.
until c is necessary. [That is, until c covers no negative instances in Ξ
cur
.]
Assign π := π, c. [We add the clause c to the program.]
Assign Ξ
cur
:= Ξ
cur
− (the positive instances in Ξ
cur
covered by π).
until π is suﬃcient.
(The termination tests for the inner and outer loops can be relaxed as appro
priate for the case of noisy instances.)
7.3 An Example
We illustrate how the algorithm works by returning to our example of airline
ﬂights. Consider the portion of an airline route map, shown in Fig. 7.3. Cities
A, B, and C are “hub” cities, and we know that there are nonstop ﬂights between
all hub cities (even those not shown on this portion of the route map). The other
7.3. AN EXAMPLE 95
cities are “satellites” of one of the hubs, and we know that there are nonstop
ﬂights between each satellite city and its hub. The learning program is given a
set of positive instances, Ξ
+
, of pairs of cities between which there are nonstop
ﬂights and a set of negative instances, Ξ
−
, of pairs of cities between which there
are not nonstop ﬂights. Ξ
+
contains just the pairs:
{< A, B >, < A, C >, < B, C >, < B, A >, < C, A >, < C, B >,
< A, A1 >, < A, A2 >, < A1, A >, < A2, A >, < B, B1 >, < B, B2 >,
< B1, B >, < B2, B >, < C, C1 >, < C, C2 >, < C1, C >, < C2, C >}
For our example, we will assume that Ξ
−
contains all those pairs of cities shown
in Fig. 7.3 that are not in Ξ
+
(a type of closedworld assumption). These are:
{< A, B1 >, < A, B2 >, < A, C1 >, < A, C2 >, < B, C1 >, < B, C2 >,
< B, A1 >, < B, A2 >, < C, A1 >, < C, A2 >, < C, B1 >, < C, B2 >,
< B1, A >, < B2, A >, < C1, A >, < C2, A >, < C1, B >, < C2, B >,
< A1, B >, < A2, B >, < A1, C >, < A2, C >, < B1, C >, < B2, C >}
There may be other cities not shown on this map, so the training set does not
necessarily exhaust all the cities.
A
B
C
C1
C2
B1
B2
A1
A2
Figure 7.3: Part of an Airline Route Map
We want the learning program to induce a program for computing the value
of the relation Nonstop. The training set, Ξ, can be thought of as a partial
96 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
description of this relation in extensional form—it explicitly names some pairs
in the relation and some pairs not in the relation. We desire to learn the
Nonstop relation as a logic program in terms of the background relations, Hub
and Satellite, which are also given in extensional form. Doing so will give us
a more compact, intensional, description of the relation, and this description
could well generalize usefully to other cities not mentioned in the map.
We assume the learning program has the following extensional deﬁnitions of
the relations Hub and Satellite:
Hub
{< A >, < B >, < C >}
All other cities mentioned in the map are assumed not in the relation Hub. We
will use the notation Hub(x) to express that the city named x is in the relation
Hub.
Satellite
{< A1, A, >, < A2, A >, < B1, B >, < B2, B >, < C1, C >, < C2, C >}
All other pairs of cities mentioned in the map are not in the relation Satellite.
We will use the notation Satellite(x,y) to express that the pair < x, y > is
in the relation Satellite.
Knowing that the predicate Nonstop is a twoplace predicate, the inner loop
of our algorithm initializes the ﬁrst clause to Nonstop(x,y) : . This clause
is not necessary because it covers all the negative examples (since it covers all
examples). So we must add a literal to its (empty) body. Suppose (selecting
a literal from the reﬁnement graph) the algorithm adds Hub(x). The following
positive instances in Ξ are covered by Nonstop(x,y) : Hub(x):
{< A, B >, < A, C >, < B, C >, < B, A >, < C, A >, < C, B >,
< A, A1 >, < A, A2 >, < B, B1 >, < B, B2 >, < C, C1 >, < C, C2 >}
To compute this covering, we interpret the logic program Nonstop(x,y) :
Hub(x) for all pairs of cities in Ξ, using the pairs given in the background
relation Hub as ground facts. The following negative instances are also covered:
7.3. AN EXAMPLE 97
{< A, B1 >, < A, B2 >, < A, C1 >, < A, C2 >, < C, A1 >, < C, A2 >,
< C, B1 >, < C, B2 >, < B, A1 >, < B, A2 >, < B, C1 >, < B, C2 >}
Thus, the clause is not yet necessary and another literal must be added. Sup
pose we next add Hub(y). The following positive instances are covered by
Nonstop(x,y) : Hub(x), Hub(y):
{< A, B >, < A, C >, < B, C >, < B, A >, < C, A >, < C, B >}
There are no longer any negative instances in Ξ covered so the clause
Nonstop(x,y) : Hub(x), Hub(y) is necessary, and we can terminate the ﬁrst
pass through the inner loop.
But the program, π, consisting of just this clause is not suﬃcient. These
positive instances are not covered by the clause:
{< A, A1 >, < A, A2 >, < A1, A >, < A2, A >, < B, B1 >, < B, B2 >,
< B1, B >, < B2, B >, < C, C1 >, < C, C2 >, < C1, C >, < C2, C >}
The positive instances that were covered by Nonstop(x,y) : Hub(x), Hub(y)
are removed from Ξ to form the Ξ
cur
to be used in the next pass through the
inner loop. Ξ
cur
consists of all the negative instances in Ξ plus the positive
instances (listed above) that are not yet covered. In order to attempt to cover
them, the inner loop creates another clause c, initially set to Nonstop(x,y)
: . This clause covers all the negative instances, and so we must add liter
als to make it necessary. Suppose we add the literal Satellite(x,y). The
clause Nonstop(x,y) : Satellite(x,y) covers no negative instances, so it is
necessary. It does cover the following positive instances in Ξ
cur
:
{< A1, A >, < A2, A >, < B1, B >, < B2, B >, < C1, C >, < C2, C >}
These instances are removed from Ξ
cur
for the next pass through the inner loop.
The program now contains two clauses:
Nonstop(x,y) : Hub(x), Hub(y)
: Satellite(x,y)
This program is not yet suﬃcient since it does not cover the following positive
instances:
{< A, A1 >, < A, A2 >, < B, B1 >, < B, B2 >, < C, C1 >, < C, C2 >}
98 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
During the next pass through the inner loop, we add the clause Nonstop(x,y)
: Satellite(y,x). This clause is necessary, and since the program containing
all three clauses is now suﬃcient, the procedure terminates with:
Nonstop(x,y) : Hub(x), Hub(y)
: Satellite(x,y)
: Satellite(y,x)
Since each clause is necessary, and the whole program is suﬃcient, the pro
gram is also consistent with all instances of the training set. Note that this
program can be applied (perhaps with good generalization) to other cities be
sides those in our partial map—so long as we can evaluate the relations Hub and
Satellite for these other cities. In the next section, we show how the technique
can be extended to use recursion on the relation we are inducing. With that
extension, the method can be used to induce more general logic programs.
7.4 Inducing Recursive Programs
To induce a recursive program, we allow the addition of a literal having the
same predicate letter as that in the head of the clause. Various mechanisms
must be used to ensure that such a program will terminate; one such is to make
sure that the new literal has diﬀerent variables than those in the head literal.
The process is best illustrated with another example. Our example continues
the one using the airline map, but we make the map somewhat simpler in order
to reduce the size of the extensional relations used. Consider the map shown
in Fig. 7.4. Again, B and C are hub cities, B1 and B2 are satellites of B, C1
and C2 are satellites of C. We have introduced two new cities, B3 and C3. No
ﬂights exist between these cities and any other cities—perhaps there are only
bus routes as shown by the grey lines in the map.
We now seek to learn a program for Canfly(x,y) that covers only those
pairs of cities that can be reached by one or more nonstop ﬂights. The relation
Canfly is satisﬁed by the following pairs of postive instances:
{< B1, B >, < B1, B2 >, < B1, C >, < B1, C1 >, < B1, C2 >,
< B, B1 >, < B2, B1 >, < C, B1 >, < C1, B1 >, < C2, B1 >,
< B2, B >, < B2, C >, < B2, C1 >, < B2, C2 >, < B, B2 >,
< C, B2 >, < C1, B2 >, < C2, B2 >, < B, C >, < B, C1 >,
< B, C2 >, < C, B >, < C1, B >, < C2, B >, < C, C1 >,
< C, C2 >, < C1, C >, < C2, C >, < C1, C2 >, < C2, C1 >}
7.4. INDUCING RECURSIVE PROGRAMS 99
B
C
C1
C2
B1
B2
B3
C3
Figure 7.4: Another Airline Route Map
Using a closedworld assumption on our map, we take the negative instances of
Canfly to be:
{< B3, B2 >, < B3, B >, < B3, B1 >, < B3, C >, < B3, C1 >,
< B3, C2 >, < B3, C3 >, < B2, B3 >, < B, B3 >, < B1, B3 >,
< C, B3 >, < C1, B3 >, < C2, B3 >, < C3, B3 >, < C3, B2 >,
< C3, B >, < C3, B1 >, < C3, C >, < C3, C1 >, < C3, C2 >,
< B2, C3 >, < B, C3 >, < B1, C3 >, < C, C3 >, < C1, C3 >,
< C2, C3 >}
We will induce Canfly(x,y) using the extensionally deﬁned background
relation Nonstop given earlier (modiﬁed as required for our reduced airline map)
and Canfly itself (recursively).
As before, we start with the empty program and proceed to the inner loop
to construct a clause that is necessary. Suppose that the inner loop adds the
background literal Nonstop(x,y). The clause Canfly(x,y) : Nonstop(x,y)
is necessary; it covers no negative instances. But it is not suﬃcient because it
does not cover the following positive instances:
{< B1, B2 >, < B1, C >, < B1, C1 >, < B1, C2 >, < B2, B1 >,
< C, B1 >, < C1, B1 >, < C2, B1 >, < B2, C >, < B2, C1 >,
< B2, C2 >, < C, B2 >, < C1, B2 >, < C2, B2 >, < B, C1 >,
100 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
< B, C2 >, < C1, B >, < C2, B >, < C1, C2 >, < C2, C1 >}
Thus, we must add another clause to the program. In the inner loop, we ﬁrst
create the clause Canfly(x,y) : Nonstop(x,z) which introduces the new
variable z. We digress brieﬂy to describe how a program containing a clause
with unbound variables in its body is interpreted. Suppose we try to inter
pret it for the positive instance Canfly(B1,B2). The interpreter attempts to
establish Nonstop(B1,z) for some z. Since Nonstop(B1, B), for example, is
a background fact, the interpreter returns T—which means that the instance
< B1, B2 > is covered. Suppose now, we attempt to interpret the clause
for the negative instance Canfly(B3,B). The interpreter attempts to estab
lish Nonstop(B3,z) for some z. There are no background facts that match, so
the clause does not cover < B3, B >. Using the interpreter, we see that the
clause Canfly(x,y) : Nonstop(x,z) covers all of the positive instances not
already covered by the ﬁrst clause, but it also covers many negative instances
such as < B2, B3 >, and < B, B3 >. So the inner loop must add another literal.
This time, suppose it adds Canfly(z,y) to yield the clause Canfly(x,y) :
Nonstop(x,z), Canfly(z,y). This clause is necessary; no negative instances
are covered. The program is now suﬃcient and consistent; it is:
Canfly(x,y) : Nonstop(x,y)
: Nonstop(x,z), Canfly(z,y)
7.5 Choosing Literals to Add
One of the ﬁrst practical ILP systems was Quinlan’s FOIL [Quinlan, 1990]. A
major problem involves deciding how to select a literal to add in the inner loop
(from among the literals that are allowed). In FOIL, Quinlan suggested that
candidate literals can be compared using an informationlike measure—similar
to the measures used in inducing decision trees. A measure that gives the same
comparison as does Quinlan’s is based on the amount by which adding a literal
increases the odds that an instance drawn at random from those covered by the
new clause is a positive instance beyond what these odds were before adding
the literal.
Let p be an estimate of the probability that an instance drawn at random
from those covered by a clause before adding the literal is a positive instance.
That is, p =(number of positive instances covered by the clause)/(total number
of instances covered by the clause). It is convenient to express this probability
in “odds form.” The odds, o, that a covered instance is positive is deﬁned to
be o = p/(1 − p). Expressing the probability in terms of the odds, we obtain
p = o/(1 +o).
7.6. RELATIONSHIPS BETWEENILP ANDDECISIONTREE INDUCTION101
After selecting a literal, l, to add to a clause, some of the instances previously
covered are still covered; some of these are positive and some are negative. Let
p
l
denote the probability that an instance drawn at random from the instances
covered by the new clause (with l added) is positive. The odds will be denoted
by o
l
. We want to select a literal, l, that gives maximal increase in these
odds. That is, if we deﬁne λ
l
= o
l
/o, we want a literal that gives a high
value of λ
l
. Specializing the clause in such a way that it fails to cover many of
the negative instances previously covered but still covers most of the positive
instances previously covered will result in a high value of λ
l
. (It turns out that
the value of Quinlan’s information theoretic measure increases monotonically
with λ
l
, so we could just as well use the latter instead.)
Besides ﬁnding a literal with a high value of λ
l
, Quinlan’s FOIL system also
restricts the choice to literals that:
a) contain at least one variable that has already been used,
b) place further restrictions on the variables if the literal selected has the
same predicate letter as the literal being induced (in order to prevent inﬁnite
recursion), and
c) survive a pruning test based on the values of λ
l
for those literals selected
so far.
We refer the reader to Quinlan’s paper for further discussion of these points.
Quinlan also discusses postprocessing pruning methods and presents experi
mental results of the method applied to learning recursive relations on lists, on
learning rules for chess endgames and for the card game Eleusis, and for some
other standard tasks mentioned in the machine learning literature.
The reader should also refer to [Pazzani & Kibler, 1992,
Lavraˇc & Dˇzeroski, 1994, Muggleton, 1991, Muggleton, 1992]. Discuss preprocessing,
postprocessing, bottomup
methods, and LINUS.
7.6 Relationships Between ILP and Decision
Tree Induction
The generic ILP algorithm can also be understood as a type of decision tree
induction. Recall the problem of inducing decision trees when the values of
attributes are categorical. When splitting on a single variable, the split at
each node involves asking to which of several mutually exclusive and exhaustive
subsets the value of a variable belongs. For example, if a node tested the variable
x
i
, and if x
i
could have values drawn from {A, B, C, D, E, F}, then one possible
split (among many) might be according to whether the value of x
i
had as value
one of {A, B, C} or one of {D, E, F}.
It is also possible to make a multivariate split—testing the values of two or
more variables at a time. With categorical variables, an nvariable split would
be based on which of several nary relations the values of the variables satisﬁed.
For example, if a node tested the variables x
i
and x
j
, and if x
i
and x
j
both
could have values drawn from {A, B, C, D, E, F}, then one possible binary split
102 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
(among many) might be according to whether or not < x
i
, x
j
> satisﬁed the
relation {< A, C >, < C, D >}. (Note that our subset method of forming single
variable splits could equivalently have been framed using 1ary relations—which
are usually called properties.)
In this framework, the ILP problem is as follows: We are given a training set,
Ξ, of positively and negatively labeled patterns whose components are drawn
from a set of variables {x, y, z, . . .}. The positively labeled patterns in Ξ form an
extensional deﬁnition of a relation, R. We are also given background relations,
R
1
, . . . , R
k
, on various subsets of these variables. (That is, we are given sets
of tuples that are in these relations.) We desire to construct an intensional
deﬁnition of R in terms of the R
1
, . . . , R
k
, such that all of the positively labeled
patterns in Ξ are satisﬁed by R and none of the negatively labeled patterns
are. The intensional deﬁnition will be in terms of a logic program in which the
relation R is the head of a set of clauses whose bodies involve the background
relations.
The generic ILP algorithm can be understood as decision tree induction,
where each node of the decision tree is itself a subdecision tree, and each sub
decision tree consists of nodes that make binary splits on several variables using
the background relations, R
i
. Thus we will speak of a toplevel decision tree
and various subdecision trees. (Actually, our decision trees will be decision
lists—a special case of decision trees, but we will refer to them as trees in our
discussions.)
In broad outline, the method for inducing an intensional version of the rela
tion R is illustrated by considering the decision tree shown in Fig. 7.5. In this
diagram, the patterns in Ξ are ﬁrst ﬁltered through the decision tree in top
level node 1. The background relation R
1
is satisﬁed by some of these patterns;
these are ﬁltered to the right (to relation R
2
), and the rest are ﬁltered to the
left (more on what happens to these later). Rightgoing patterns are ﬁltered
through a sequence of relational tests until only positively labeled patterns sat
isfy the last relation—in this case R
3
. That is, the subset of patterns satisfying
all the relations, R
1
, R
2
, and R
3
contains only positive instances from Ξ. (We
might say that this combination of tests is necessary. They correspond to the
clause created in the ﬁrst pass through the inner loop of the generic ILP algo
rithm.) Let us call the subset of patterns satisfying these relations, Ξ
1
; these
satisfy Node 1 at the top level. All other patterns, that is {Ξ − Ξ
1
} = Ξ
2
are
ﬁltered to the left by Node 1.
Ξ
2
is then ﬁltered by toplevel Node 2 in much the same manner, so that
Node 2 is satisﬁed only by the positively labeled samples in Ξ
2
. We continue
ﬁltering through toplevel nodes until only the negatively labeled patterns fail to
satisfy a top node. In our example, Ξ
4
contains only negatively labeled patterns
and the union of Ξ
1
and Ξ
3
contains all the positively labeled patterns. The
relation, R, that distinguishes positive from negative patterns in Ξ is then given
in terms of the following logic program:
R : R1, R2, R3
7.6. RELATIONSHIPS BETWEENILP ANDDECISIONTREE INDUCTION103
R
1
R
2
R
3
T
T
T
F
F
F
T
F
R
4
R
5
T
T
F
F
T
F
U
U
1
U
2
= U < U
1
U
3
U
4
= U
2
< U
3
Node 1
Node 2
(only positive
instances
satisfy all three
tests)
(only positivel
instances satisfy
these two tests)
(only negative
instances)
Figure 7.5: A Decision Tree for ILP
: R4, R5
If we apply this sort of decisiontree induction procedure to the problem
of generating a logic program for the relation Nonstop (refer to Fig. 7.3), we
obtain the decision tree shown in Fig. 7.6. The logic program resulting from
this decision tree is the same as that produced by the generic ILP algorithm.
In setting up the problem, the training set, Ξ can be expressed as a set of 2
dimensional vectors with components x and y. The values of these components
range over the cities {A, B, C, A1, A2, B1, B2, C1, C2} except (for simplicity)
we do not allow patterns in which x and y have the same value. As before, the
relation, Nonstop, contains the following pairs of cities, which are the positive
instances:
{< A, B >, < A, C >, < B, C >, < B, A >, < C, A >, < C, B >,
< A, A1 >, < A, A2 >, < A1, A >, < A2, A >, < B, B1 >, < B, B2 >,
< B1, B >, < B2, B >, < C, C1 >, < C, C2 >, < C1, C >, < C2, C >}
All other pairs of cities named in the map of Fig. 7.3 (using the closed world
assumption) are not in the relation Nonstop and thus are negative instances.
Because the values of x and y are categorical, decisiontree induction would
be a very diﬃcult task—involving as it does the need to invent relations on
104 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
x and y to be used as tests. But with the background relations, R
i
(in this
case Hub and Satellite), the problem is made much easier. We select these
relations in the same way that we select literals; from among the available tests,
we make a selection based on which leads to the largest value of λ
Ri
.
7.7 Bibliographical and Historical Remarks
To be added.
7.7. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 105
Hub(x)
T
F
U
Node 1
(top level)
{<A,B>, <A,C>,
<B,C>, <B,A>,
<C,A>, <C,B>}
Hub(y)
T
T
F Node 2
(top level)
Satellite(x,y)
F
T
T
{<A1,A>, <A2,A>, <B1,B>,
<B2,B>, <C1,C>, <C2,C>}
F
{<A,A1>, <A,A2>,<B,B1>,
<B,B2>, <C,C1>, <C,C2>}
Satellite(y,x)
F
F
T
Node 3
(top level)
T
{Only negative instances}
(Only positive instances)
(Only positive instances)
(Only positive instances)
F
Figure 7.6: A Decision Tree for the Airline Route Problem
106 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING
Chapter 8
Computational Learning
Theory
In chapter one we posed the problem of guessing a function given a set of
sample inputs and their values. We gave some intuitive arguments to support
the claim that after seeing only a small fraction of the possible inputs (and
their values) that we could guess almost correctly the values of most subsequent
inputs—if we knew that the function we were trying to guess belonged to an
appropriately restricted subset of functions. That is, a given training set of
sample patterns might be adequate to allow us to select a function, consistent
with the labeled samples, from among a restricted set of hypotheses such that
with high probability the function we select will be approximately correct (small
probability of error) on subsequent samples drawn at random according to the
same distribution from which the labeled samples were drawn. This insight
led to the theory of probably approximately correct (PAC) learning—initially
developed by Leslie Valiant [Valiant, 1984]. We present here a brief description
of the theory for the case of Boolean functions. [Dietterich, 1990, Haussler, 1988,
Haussler, 1990] give nice surveys of the important results. Other overviews?
8.1 Notation and Assumptions for PAC Learn
ing Theory
We assume a training set Ξ of ndimensional vectors, X
i
, i = 1, . . . , m, each
labeled (by 1 or 0) according to a target function, f, which is unknown to
the learner. The probability of any given vector X being in Ξ, or later being
presented to the learner, is P(X). The probability distribution, P, can be
arbitrary. (In the literature of PAC learning theory, the target function is usually
called the target concept and is denoted by c, but to be consistent with our
previous notation we will continue to denote it by f.) Our problem is to guess
107
108 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
a function, h(X), based on the labeled samples in Ξ. In PAC theory such a
guessed function is called the hypothesis. We assume that the target function
is some element of a set of functions, C. We also assume that the hypothesis,
h, is an element of a set, H, of hypotheses, which includes the set, C, of target
functions. H is called the hypothesis space.
In general, h won’t be identical to f, but we can strive to have the value of
h(X) = the value of f(X) for most X’s. That is, we want h to be approximately
correct. To quantify this notion, we deﬁne the error of h, ε
h
, as the probability
that an X drawn randomly according to P will be misclassiﬁed:
ε
h
=
¸
[X:h(X)=f(X)]
P(X)
Boldface symbols need to be
smaller when they are subscripts in
math environments.
We say that h is approximately (except for ε ) correct if ε
h
≤ ε, where ε is the
accuracy parameter.
Suppose we are able to ﬁnd an h that classiﬁes all m randomly drawn training
samples correctly; that is, h is consistent with this randomly selected training
set, Ξ. If m is large enough, will such an h be approximately correct (and
for what value of ε)? On some training occasions, using m randomly drawn
training samples, such an h might turn out to be approximately correct (for a
given value of ε), and on others it might not. We say that h is probably (except
for δ) approximately correct (PAC) if the probability that it is approximately
correct is greater than 1−δ, where δ is the conﬁdence parameter. We shall show
that if m is greater than some bound whose value depends on ε and δ, such an
h is guaranteed to be probably approximately correct.
In general, we say that a learning algorithm PAClearns functions from C in
terms of H iﬀ for every function f C, it outputs a hypothesis h H, such that
with probability at least (1 − δ), ε
h
≤ ε. Such a hypothesis is called probably
(except for δ) approximately (except for ε) correct.
We want learning algorithms that are tractable, so we want an algorithm
that PAClearns functions in polynomial time. This can only be done for certain
classes of functions. If there are a ﬁnite number of hypotheses in a hypothesis
set (as there are for many of the hypothesis sets we have considered), we could
always produce a consistent hypothesis from this set by testing all of them
against the training data. But if there are an exponential number of hypotheses,
that would take exponential time. We seek training methods that produce
consistent hypotheses in less time. The time complexities for various hypothesis
sets have been determined, and these are summarized in a table to be presented
later.
A class, C, is polynomially PAC learnable in terms of H provided there exists
a polynomialtime learning algorithm (polynomial in the number of samples
needed, m, in the dimension, n, in 1/ε, and in 1/δ) that PAClearns functions
in C in terms of H.
Initial work on PAC assumed H = C, but it was later shown that some func
tions cannot be polynomially PAClearned under such an assumption (assuming
8.2. PAC LEARNING 109
P = NP)—but can be polynomially PAClearned if H is a strict superset of C!
Also our deﬁnition does not specify the distribution, P, from which patterns
are drawn nor does it say anything about the properties of the learning algo
rithm. Since C and H do not have to be identical, we have the further restrictive
deﬁnition:
A properly PAClearnable class is a class C for which there exists an algorithm
that polynomially PAClearns functions from C in terms of C.
8.2 PAC Learning
8.2.1 The Fundamental Theorem
Suppose our learning algorithm selects some h randomly from among those that
are consistent with the values of f on the m training patterns. The probability
that the error of this randomly selected h is greater than some ε, with h consis
tent with the values of f(X) for m instances of X (drawn according to arbitrary
P), is less than or equal to He
−εm
, where H is the number of hypotheses in
H. We state this result as a theorem [Blumer, et al., 1987]:
Theorem 8.1 (Blumer, et al.) Let H be any set of hypotheses, Ξ be a set of
m ≥ 1 training examples drawn independently according to some distribution
P, f be any classiﬁcation function in H, and ε > 0. Then, the probability that
there exists a hypothesis h consistent with f for the members of Ξ but with error
greater than ε is at most He
−εm
.
Proof:
Consider the set of all hypotheses, {h
1
, h
2
, . . . , h
i
, . . . , h
S
}, in H, where S =
H. The error for h
i
is ε
hi
= the probability that h
i
will classify a pattern in
error (that is, diﬀerently than f would classify it). The probability that h
i
will
classify a pattern correctly is (1−ε
hi
). A subset, H
B
, of Hwill have error greater
than ε. We will call the hypotheses in this subset bad. The probability that any
particular one of these bad hypotheses, say h
b
, would classify a pattern correctly
is (1−ε
h
b
). Since ε
h
b
> ε, the probability that h
b
(or any other bad hypothesis)
would classify a pattern correctly is less than (1 − ε). The probability that it
would classify all m independently drawn patterns correctly is then less than
(1 −ε)
m
.
That is,
prob[h
b
classiﬁes all m patterns correctly h
b
H
B
] ≤ (1 −ε)
m
.
prob[some h H
B
classiﬁes all m patterns correctly]
=
¸
h
b
H
B
prob[h
b
classiﬁes all m patterns correctly h
b
H
B
]
≤ K(1 −ε)
m
, where K = H
B
.
110 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
That is,
prob[there is a bad hypothesis that classiﬁes all m patterns correctly]
≤ K(1 −ε)
m
.
Since K ≤ H and (1 −ε)
m
≤ e
−εm
, we have:
prob[there is a bad hypothesis that classiﬁes all m patterns correctly]
= prob[there is a hypothesis with error > ε and that classiﬁes all m patterns
correctly] ≤ He
−εm
.
QED
A corollary of this theorem is:
Corollary 8.2 Given m ≥ (1/ε)(ln H + ln(1/δ)) independent samples, the
probability that there exists a hypothesis in H that is consistent with f on these
samples and has error greater than ε is at most δ.
Proof: We are to ﬁnd a bound on m that guarantees that
prob[there is a hypothesis with error > ε and that classiﬁes all m patterns
correctly] ≤ δ. Thus, using the result of the theorem, we must show that
He
−εm
≤ δ. Taking the natural logarithm of both sides yields:
ln H −εm ≤ ln δ
or
m ≥ (1/ε)(ln H + ln(1/δ))
QED
This corollary is important for two reasons. First it clearly states that we
can select any hypothesis consistent with the m samples and be assured that
with probability (1 − δ) its error will be less than ε. Also, it shows that in
order for m to increase no more than polynomially with n, H can be no larger
than 2
O(n
k
)
. No class larger than that can be guaranteed to be properly PAC
learnable.
Here is a possible point of confusion: The bound given in the corollary is
an upper bound on the value of m needed to guarantee polynomial probably ap
proximately correct learning. Values of m greater than that bound are suﬃcient
(but might not be necessary). We will present a lower (necessary) bound later
in the chapter.
8.2. PAC LEARNING 111
8.2.2 Examples
Terms
Let H be the set of terms (conjunctions of literals). Then, H = 3
n
, and
m ≥ (1/ε)(ln(3
n
) + ln(1/δ))
≥ (1/ε)(1.1n + ln(1/δ))
Note that the bound on m increases only polynomially with n, 1/ε, and 1/δ.
For n = 50, ε = 0.01 and δ = 0.01, m ≥ 5, 961 guarantees PAC learnability.
In order to show that terms are properly PAC learnable, we additionally
have to show that one can ﬁnd in time polynomial in m and n a hypothesis
h consistent with a set of m patterns labeled by the value of a term. The
following procedure for ﬁnding such a consistent hypothesis requires O(nm)
steps (adapted from [Dietterich, 1990, page 268]):
We are given a training sequence, Ξ, of m examples. Find the ﬁrst pattern,
say X
1
, in that list that is labeled with a 1. Initialize a Boolean function,
h, to the conjunction of the n literals corresponding to the values of the n
components of X
1
. (Components with value 1 will have corresponding positive
literals; components with value 0 will have corresponding negative literals.) If
there are no patterns labeled by a 1, we exit with the null concept (h ≡ 0 for
all patterns). Then, for each additional pattern, X
i
, that is labeled with a 1,
we delete from h any Boolean variables appearing in X
i
with a sign diﬀerent
from their sign in h. After processing all the patterns labeled with a 1, we check
all of the patterns labeled with a 0 to make sure that none of them is assigned
value 1 by h. If, at any stage of the algorithm, any patterns labeled with a 0
are assigned a 1 by h, then there exists no term that consistently classiﬁes the
patterns in Ξ, and we exit with failure. Otherwise, we exit with h. Change this paragraph if this
algorithm was presented in Chapter
Three. As an example, consider the following patterns, all labeled with a 1 (from
[Dietterich, 1990]):
(0, 1, 1, 0)
(1, 1, 1, 0)
(1, 1, 0, 0)
After processing the ﬁrst pattern, we have h = x
1
x
2
x
3
x
4
; after processing the
second pattern, we have h = x
2
x
3
x
4
; ﬁnally, after the third pattern, we have
h = x
2
x
4
.
Linearly Separable Functions
Let H be the set of all linearly separable functions. Then, H ≤ 2
n
2
, and
112 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
m ≥ (1/ε)
n
2
ln 2 + ln(1/δ)
Again, note that the bound on m increases only polynomially with n, 1/ε, and
1/δ.
For n = 50, ε = 0.01 and δ = 0.01, m ≥ 173, 748 guarantees PAC learnabil
ity.
To show that linearly separable functions are properly PAC learnable, we
would have additionally to show that one can ﬁnd in time polynomial in m and
n a hypothesis h consistent with a set of m labeled linearly separable patterns. Linear programming is polynomial.
8.2.3 Some Properly PACLearnable Classes
Some properly PAClearnable classes of functions are given in the following
table. (Adapted from [Dietterich, 1990, pages 262 and 268] which also gives
references to proofs of some of the time complexities.)
H H Time Complexity P. Learnable?
terms 3
n
polynomial yes
kterm DNF 2
O(kn)
NPhard no
(k disjunctive terms)
kDNF 2
O(n
k
)
polynomial yes
(a disjunction of ksized terms)
kCNF 2
O(n
k
)
polynomial yes
(a conjunction of ksized clauses)
kDL 2
O(n
k
k lg n)
polynomial yes
(decision lists with ksized terms)
lin. sep. 2
O(n
2
)
polynomial yes
lin. sep. with (0,1) weights ? NPhard no
k2NN ? NPhard no
DNF 2
2
n
polynomial no
(all Boolean functions)
(Members of the class k2NN are twolayer, feedforward neural networks with
exactly k hidden units and one output unit.)
Summary: In order to show that a class of functions is Properly PAC
Learnable :
a. Show that there is an algorithm that produces a consistent hypothesis on
m ndimensional samples in time polynomial in m and n.
b. Show that the sample size, m, needed to ensure PAC learnability is polyno
mial (or better) in (1/ε), (1/δ), and n by showing that ln H is polynomial
or better in the number of dimensions.
8.3. THE VAPNIKCHERVONENKIS DIMENSION 113
As hinted earlier, sometimes enlarging the class of hypotheses makes learning
easier. For example, the table above shows that kCNF is PAC learnable, but
ktermDNF is not. And yet, ktermDNF is a subclass of kCNF! So, even if
the target function were in ktermDNF, one would be able to ﬁnd a hypothesis
in kCNF that is probably approximately correct for the target function. Sim
ilarly, linearly separable functions implemented by TLUs whose weight values
are restricted to 0 and 1 are not properly PAC learnable, whereas unrestricted
linearly separable functions are. It is possible that enlarging the space of hy
potheses makes ﬁnding one that is consistent with the training examples easier.
An interesting question is whether or not the class of functions in k2NN is poly
nomially PAC learnable if the hypotheses are drawn from k
2NN with k
> k.
(At the time of writing, this matter is still undecided.)
Although PAC learning theory is a powerful analytic tool, it (like complexity
theory) deals mainly with worstcase results. The fact that the class of two
layer, feedforward neural networks is not polynomially PAC learnable is more an
attack on the theory than it is on the networks, which have had many successful
applications. As [Baum, 1994, page 41617] says: “ . . . humans are capable of
learning in the natural world. Therefore, a proof within some model of learning
that learning is not feasible is an indictment of the model. We should examine
the model to see what constraints can be relaxed and made more realistic.”
8.3 The VapnikChervonenkis Dimension
8.3.1 Linear Dichotomies
Consider a set, H, of functions, and a set, Ξ, of (unlabeled) patterns. One
measure of the expressive power of a set of hypotheses, relative to Ξ, is its
ability to make arbitrary classiﬁcations of the patterns in Ξ.
1
If there are m
patterns in Ξ, there are 2
m
diﬀerent ways to divide these patterns into two
disjoint and exhaustive subsets. We say there are 2
m
diﬀerent dichotomies of
Ξ. If Ξ were to include all of the 2
n
Boolean patterns, for example, there are
2
2
n
ways to dichotomize them, and (of course) the set of all possible Boolean
functions dichotomizes them in all of these ways. But a subset, H, of the Boolean
functions might not be able to dichotomize an arbitrary set, Ξ, of m Boolean
patterns in all 2
m
ways. In general (that is, even in the nonBoolean case), we
say that if a subset, H, of functions can dichotomize a set, Ξ, of m patterns in
all 2
m
ways, then H shatters Ξ.
As an example, consider a set Ξ of m patterns in the ndimensional space,
R
n
. (That is, the n components of these patterns are real numbers.) We deﬁne
a linear dichotomy as one implemented by an (n−1)dimensional hyperplane in
the ndimensional space. How many linear dichotomies of m patterns in n di
mensions are there? For example, as shown in Fig. 8.1, there are 14 dichotomies
1
And, of course, if a hypothesis drawn from a set that could make arbitrary classiﬁcations
of a set of training patterns, there is little likelihood that such a hypothesis will generalize
well beyond the training set.
114 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
of four points in two dimensions (each separating line yields two dichotomies
depending on whether the points on one side of the line are classiﬁed as 1 or 0).
(Note that even though there are an inﬁnite number of hyperplanes, there are,
nevertheless, only a ﬁnite number of ways in which hyperplanes can dichotomize
a ﬁnite number of patterns. Small movements of a hyperplane typically do not
change the classiﬁcations of any patterns.)
1 2
3
4
14 dichotomies of 4 points in 2 dimensions
5
6
7
Figure 8.1: Dichotomizing Points in Two Dimensions
The number of dichotomies achievable by hyperplanes depends on how the
patterns are disposed. For the maximum number of linear dichotomies, the
points must be in what is called general position. For m > n, we say that a set
of m points is in general position in an ndimensional space if and only if no
subset of (n+1) points lies on an (n−1)dimensional hyperplane. When m ≤ n,
a set of m points is in general position if no (m − 2)dimensional hyperplane
contains the set. Thus, for example, a set of m ≥ 4 points is in general position
in a threedimensional space if no four of them lie on a (twodimensional) plane.
We will denote the number of linear dichotomies of m points in general position
in an ndimensional space by the expression Π
L
(m, n).
It is not too diﬃcult to verify that: Include the derivation.
Π
L
(m, n) = 2
n
¸
i=0
C(m−1, i) for m > n, and
= 2
m
for m ≤ n
8.3. THE VAPNIKCHERVONENKIS DIMENSION 115
where C(m−1, i) is the binomial coeﬃcient
(m−1)!
(m−1−i)!i!
.
The table below shows some values for Π
L
(m, n).
m n
(no. of patterns) (dimension)
1 2 3 4 5
1 2 2 2 2 2
2 4 4 4 4 4
3 6 8 8 8 8
4 8 14 16 16 16
5 10 22 30 32 32
6 12 32 52 62 64
7 14 44 84 114 126
8 16 58 128 198 240
Note that the class of linear dichotomies shatters the m patterns if m ≤ n + 1.
The boldface entries in the table correspond to the highest values of m for
which linear dichotomies shatter m patterns in n dimensions.
8.3.2 Capacity
Let P
m,n
=
Π
L
(m,n)
2
m
= the probability that a randomly selected dichotomy (out
of the 2
m
possible dichotomies of m patterns in n dimensions) will be linearly
separable. In Fig. 8.2 we plot P
λ(n+1),n
versus λ and n, where λ = m/(n + 1).
Note that for large n (say n > 30) how quickly P
m,n
falls from 1 to 0 as
m goes above 2(n + 1). For m < 2(n + 1), any dichotomy of the m points is
almost certainly linearly separable. But for m > 2(n + 1), a randomly selected
dichotomy of the m points is almost certainly not linearly separable. For this
reason m = 2(n + 1) is called the capacity of a TLU [Cover, 1965]. Unless the
number of training patterns exceeds the capacity, the fact that a TLU separates
those training patterns according to their labels means nothing in terms of how
well that TLU will generalize to new patterns. There is nothing special about
a separation found for m < 2(n + 1) patterns—almost any dichotomy of those
patterns would have been linearly separable. To make sure that the separation
found is forced by the training set and thus generalizes well, it has to be the
case that there are very few linearly separable functions that would separate
the m training patterns.
Analogous results about the generalizing abilities of neural networks have
been developed by [Baum & Haussler, 1989] and given intuitive and experimen
tal justiﬁcation in [Baum, 1994, page 438]:
“The results seemed to indicate the following heuristic rule holds. If
M examples [can be correctly classiﬁed by] a net with W weights (for
M >> W), the net will make a fraction ε of errors on new examples
chosen from the same [uniform] distribution where ε = W/M.”
116 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
0
1
2
3
4
10
20
30
40
50
0
0.25
0.5
0.75
1
0
1
2
3
4
10
20
30
40
50
0
25
.5
75
1
P
h(n + 1), n
h
n
Figure 8.2: Probability that a Random Dichotomy is Linearly Separable
8.3.3 A More General Capacity Result
Corollary 7.2 gave us an expression for the number of training patterns suﬃcient
to guarantee a required level of generalization—assuming that the function we
were guessing was a function belonging to a class of known and ﬁnite cardinality.
The capacity result just presented applies to linearly separable functions for non
binary patterns. We can extend these ideas to general dichotomies of nonbinary
patterns.
In general, let us denote the maximum number of dichotomies of any set
of m ndimensional patterns by hypotheses in H as Π
H
(m, n). The number of
dichotomies will, of course, depend on the disposition of the m points in the
ndimensional space; we take Π
H
(m, n) to be the maximum over all possible
arrangements of the m points. (In the case of the class of linearly separable
functions, the maximum number is achieved when the m points are in general
position.) For each class, H, there will be some maximum value of m for which
Π
H
(m, n) = 2
m
, that is, for which H shatters the m patterns. This maximum
number is called the VapnikChervonenkis (VC) dimension and is denoted by
VCdim(H) [Vapnik & Chervonenkis, 1971].
We saw that for the class of linear dichotomies, VCdim(Linear) = (n + 1).
As another example, let us calculate the VC dimension of the hypothesis space
of single intervals on the real line—used to classify points on the real line. We
show an example of how points on the line might be dichotomized by a single
interval in Fig. 8.3. The set Ξ could be, for example, {0.5, 2.5,  2.3, 3.14}, and
one of the hypotheses in our set would be [1, 4.5]. This hypothesis would label
the points 2.5 and 3.14 with a 1 and the points  2.3 and 0.5 with a 0. This
8.3. THE VAPNIKCHERVONENKIS DIMENSION 117
set of hypotheses (single intervals on the real line) can arbitrarily classify any
two points. But no single interval can classify three points such that the outer
two are classiﬁed as 1 and the inner one as 0. Therefore the VC dimension of
single intervals on the real line is 2. As soon as we have many more than 2
training patterns on the real line and provided we know that the classiﬁcation
function we are trying to guess is a single interval, then we begin to have good
generalization.
Figure 8.3: Dichotomizing Points by an Interval
The VC dimension is a useful measure of the expressive power of a hypothesis
set. Since any dichotomy of VCdim(H) or fewer patterns in general position in n
dimensions can be achieved by some hypothesis in H, we must have many more
than VCdim(H) patterns in the training set in order that a hypothesis consistent
with the training set is suﬃciently constrained to imply good generalization.
Our examples have shown that the concept of VC dimension is not restricted
to Boolean functions.
8.3.4 Some Facts and Speculations About the VC Dimen
sion
• If there are a ﬁnite number, H, of hypotheses in H, then:
VCdim(H) ≤ log(H)
• The VC dimension of terms in n dimensions is n.
• Suppose we generalize our example that used a hypothesis set of single
intervals on the real line. Now let us consider an ndimensional feature
space and tests of the form L
i
≤ x
i
≤ H
i
. We allow only one such test per
dimension. A hypothesis space consisting of conjunctions of these tests
(called axisparallel hyperrectangles) has VC dimension bounded by:
n ≤ VCdim ≤ 2n
• As we have already seen, TLUs with n inputs have a VC dimension of
n + 1.
• [Baum, 1994, page 438] gives experimental evidence for the proposition
that “ . . . multilayer [neural] nets have a VC dimension roughly equal to
their total number of [adjustable] weights.”
118 CHAPTER 8. COMPUTATIONAL LEARNING THEORY
8.4 VC Dimension and PAC Learning
There are two theorems that connect the idea of VC dimension with PAC learn
ing [Blumer, et al., 1990]. We state these here without proof.
Theorem 8.3 (Blumer, et al.) A hypothesis space H is PAC learnable iﬀ it
has ﬁnite VC dimension.
Theorem 8.4 A set of hypotheses, H, is properly PAC learnable if:
a. m ≥ (1/ε) max [4 lg(2/δ), 8 VCdim lg(13/ε)], and
b. if there is an algorithm that outputs a hypothesis h H consistent with the
training set in polynomial (in m and n) time.
The second of these two theorems improves the bound on the number of
training patterns needed for linearly separable functions to one that is linear
in n. In our previous example of how many training patterns were needed to
ensure PAC learnability of a linearly separable function if n = 50, ε = 0.01, and
δ = 0.01, we obtained m ≥ 173, 748. Using the Blumer, et al. result we would
get m ≥ 52, 756.
As another example of the second theorem, let us take H to be the set of
closed intervals on the real line. The VC dimension is 2 (as shown previously).
With n = 50, ε = 0.01, and δ = 0.01, m ≥ 16, 551 ensures PAC learnability.
There is also a theorem that gives a lower (necessary) bound on the number
of training patterns required for PAC learning [Ehrenfeucht, et al., 1988]:
Theorem 8.5 Any PAC learning algorithm must examine at least
Ω(1/ε lg(1/δ) + VCdim(H)) training patterns.
The diﬀerence between the lower and upper bounds is
O(log(1/ε)VCdim(H)/ε).
8.5 Bibliographical and Historical Remarks
To be added.
Chapter 9
Unsupervised Learning
9.1 What is Unsupervised Learning?
Consider the various sets of points in a twodimensional space illustrated in Fig.
9.1. The ﬁrst set (a) seems naturally partitionable into two classes, while the
second (b) seems diﬃcult to partition at all, and the third (c) is problematic.
Unsupervised learning uses procedures that attempt to ﬁnd natural partitions
of patterns. There are two stages:
• Form an Rway partition of a set Ξ of unlabeled training patterns (where
the value of R, itself, may need to be induced from the patterns). The
partition separates Ξ into R mutually exclusive and exhaustive subsets,
Ξ
1
, . . . , Ξ
R
, called clusters.
• Design a classiﬁer based on the labels assigned to the training patterns by
the partition.
We will explain shortly various methods for deciding how many clusters there
should be and for separating a set of patterns into that many clusters. We can
base some of these methods, and their motivation, on minimumdescription
length (MDL) principles. In that setting, we assume that we want to encode
a description of a set of points, Ξ, into a message of minimal length. One
encoding involves a description of each point separately; other, perhaps shorter,
encodings might involve a description of clusters of points together with how
each point in a cluster can be described given the cluster it belongs to. The
speciﬁc techniques described in this chapter do not explicitly make use of MDL
principles, but the MDL method has been applied with success. One of the
MDLbased methods, Autoclass II [Cheeseman, et al., 1988] discovered a new
classiﬁcation of stars based on the properties of infrared sources.
Another type of unsupervised learning involves ﬁnding hierarchies of par
titionings or clusters of clusters. A hierarchical partition is one in which Ξ is
119
120 CHAPTER 9. UNSUPERVISED LEARNING
a) two clusters
b) one cluster
c) ?
Figure 9.1: Unlabeled Patterns
divided into mutually exclusive and exhaustive subsets, Ξ
1
, . . . , Ξ
R
; each set,
Ξ
i
, (i = 1, . . . , R) is divided into mutually exclusive and exhaustive subsets,
and so on. We show an example of such a hierarchical partition in Fig. 9.2.
The hierarchical form is best displayed as a tree, as shown in Fig. 9.3. The tip
nodes of the tree can further be expanded into their individual pattern elements.
One application of such hierarchical partitions is in organizing individuals into
taxonomic hierarchies such as those used in botany and zoology.
9.2 Clustering Methods
9.2.1 A Method Based on Euclidean Distance
Most of the unsupervised learning methods use a measure of similarity between
patterns in order to group them into clusters. The simplest of these involves
deﬁning a distance between patterns. For patterns whose features are numeric,
the distance measure can be ordinary Euclidean distance between two points in
an ndimensional space.
There is a simple, iterative clustering method based on distance. It can
be described as follows. Suppose we have R randomly chosen cluster seekers,
C
1
, . . . , C
R
. These are points in an ndimensional space that we want to adjust
so that they each move toward the center of one of the clusters of patterns.
We present the (unlabeled) patterns in the training set, Ξ, to the algorithm
9.2. CLUSTERING METHODS 121
U
11
U
12
U
21
U
22
U
23
U
31
U
32
U
11
F
U
12
= U
1
U
21
F
U
22
F
U
23
= U
2
U
31
F U
32
= U
3
U
1
F U
2
F U
3
= U
Figure 9.2: A Hierarchy of Clusters
onebyone. For each pattern, X
i
, presented, we ﬁnd that cluster seeker, C
j
,
that is closest to X
i
and move it closer to X
i
:
C
j
←− (1 −α
j
)C
j
+α
j
X
i
where α
j
is a learning rate parameter for the jth cluster seeker; it determines
how far C
j
is moved toward X
i
.
Reﬁnements on this procedure make the cluster seekers move less far as
training proceeds. Suppose each cluster seeker, C
j
, has a mass, m
j
, equal to
the number of times that it has moved. As a cluster seeker’s mass increases it
moves less far towards a pattern. For example, we might set α
j
= 1/(1 + m
j
)
and use the above rule together with m
j
←− m
j
+1. With this adjustment rule,
a cluster seeker is always at the center of gravity (sample mean) of the set of
patterns toward which it has so far moved. Intuitively, if a cluster seeker ever
gets within some reasonably well clustered set of patterns (and if that cluster
seeker is the only one so located), it will converge to the center of gravity of
that cluster.
122 CHAPTER 9. UNSUPERVISED LEARNING
U
U
2
U
11
U
12
U
31
U
32
U
21
U
22
U
23
U
1 U
3
Figure 9.3: Displaying a Hierarchy as a Tree
Once the cluster seekers have converged, the classiﬁer implied by the now
labeled patterns in Ξ can be based on a Voronoi partitioning of the space (based
on distances to the various cluster seekers). This kind of classiﬁcation, an ex
ample of which is shown in Fig. 9.4, can be implemented by a linear machine.
Georgy Fedoseevich Voronoi, was a
Russian mathematician who lived
from 1868 to 1909. When basing partitioning on distance, we seek clusters whose patterns are
as close together as possible. We can measure the badness, V , of a cluster of
patterns, {X
i
}, by computing its sample variance deﬁned by:
V = (1/K)
¸
i
(X
i
−M)
2
where M is the sample mean of the cluster, which is deﬁned to be:
M = (1/K)
¸
i
X
i
and K is the number of points in the cluster.
We would like to partition a set of patterns into clusters such that the sum of
the sample variances (badnesses) of these clusters is small. Of course if we have
one cluster for each pattern, the sample variances will all be zero, so we must
arrange that our measure of the badness of a partition must increase with the
number of clusters. In this way, we can seek a tradeoﬀ between the variances of
9.2. CLUSTERING METHODS 123
C
1
C
2
C
3
Separating boundaries
Figure 9.4: MinimumDistance Classiﬁcation
the clusters and the number of them in a way somewhat similar to the principle
of minimal description length discussed earlier.
Elaborations of our basic clusterseeking procedure allow the number of clus
ter seekers to vary depending on the distances between them and depending on
the sample variances of the clusters. For example, if the distance, d
ij
, between
two cluster seekers, C
i
and C
j
, ever falls below some threshold ε, then we can
replace them both by a single cluster seeker placed at their center of gravity
(taking into account their respective masses). In this way we can decrease the
overall badness of a partition by reducing the number of clusters for compara
tively little penalty in increased variance.
On the other hand, if any of the cluster seekers, say C
i
, deﬁnes a cluster
whose sample variance is larger than some amount δ, then we can place a new
cluster seeker, C
j
, at some random location somewhat adjacent to C
i
and reset
the masses of both C
i
and C
j
to zero. In this way the badness of the par
tition might ultimately decrease by decreasing the total sample variance with
comparatively little penalty for the additional cluster seeker. The values of the
parameters ε and δ are set depending on the relative weights given to sample
variances and numbers of clusters.
In distancebased methods, it is important to scale the components of the
pattern vectors. The variation of values along some dimensions of the pattern
vector may be much diﬀerent than that of other dimensions. One commonly
used technique is to compute the standard deviation (i.e., the square root of the
variance) of each of the components over the entire training set and normalize
the values of the components so that their adjusted standard deviations are
equal.
124 CHAPTER 9. UNSUPERVISED LEARNING
9.2.2 A Method Based on Probabilities
Suppose we have a partition of the training set, Ξ, into R mutually exclusive
and exhaustive clusters, C
1
, . . . , C
R
. We can decide to which of these clusters
some arbitrary pattern, X, should be assigned by selecting the C
i
for which
the probability, p(C
i
X), is largest, providing p(C
i
X) is larger than some ﬁxed
threshold, δ. As we saw earlier, we can use Bayes rule and base our decision on
maximizing p(XC
i
)p(C
i
). Assuming conditional independence of the pattern
components, x
i
, the quantity to be maximized is:
S(X, C
i
) = p(x
1
C
i
)p(x
2
C
i
) · · · p(x
n
C
i
)p(C
i
)
The p(x
j
C
i
) can be estimated from the sample statistics of the patterns in the
clusters and then used in the above expression. (Recall the linear form that this
formula took in the case of binaryvalued components.)
We call S(X, C
i
) the similarity of X to a cluster, C
i
, of patterns. Thus, we
assign X to the cluster to which it is most similar, providing the similarity is
larger than δ.
Just as before, we can deﬁne the sample mean of a cluster, C
i
, to be:
M
i
= (1/K
i
)
¸
Xj Ci
X
j
where K
i
is the number of patterns in C
i
.
We can base an iterative clustering algorithm on this measure of similarity
[Mahadevan & Connell, 1992]. It can be described as follows:
a. Begin with a set of unlabeled patterns Ξ and an empty list, L, of clusters.
b. For the next pattern, X, in Ξ, compute S(X, C
i
) for each cluster, C
i
.
(Initially, these similarities are all zero.) Suppose the largest of these
similarities is S(X, C
max
).
(a) If S(X, C
max
) > δ, assign X to C
max
. That is,
C
max
←− C
max
∪ {X}
Update the sample statistics p(x
1
C
max
), p(x
2
C
max
), . . . , p(x
n
C
max
),
and p(C
max
) to take the new pattern into account. Go to 3.
(b) If S(X, C
max
) ≤ δ, create a new cluster, C
new
= {X} and add C
new
to L. Go to 3.
c. Merge any existing clusters, C
i
and C
j
if (M
i
− M
j
)
2
< ε. Compute
new sample statistics p(x
1
C
merge
), p(x
2
C
merge
), . . . , p(x
n
C
merge
), and
p(C
merge
) for the merged cluster, C
merge
= C
i
∪ C
j
.
9.3. HIERARCHICAL CLUSTERING METHODS 125
d. If the sample statistics of the clusters have not changed during an entire
iteration through Ξ, then terminate with the clusters in L; otherwise go
to 2.
The value of the parameter δ controls the number of clusters. If δ is high,
there will be a large number of clusters with few patterns in each cluster. For
small values of δ, there will be a small number of clusters with many patterns in
each cluster. Similarly, the larger the value of ε, the smaller the number clusters
that will be found.
Designing a classiﬁer based on the patterns labeled by the partitioning is
straightforward. We assign any pattern, X, to that category that maximizes
S(X, C
i
). Mention “kmeans and “EM”
methods.
9.3 Hierarchical Clustering Methods
9.3.1 A Method Based on Euclidean Distance
Suppose we have a set, Ξ, of unlabeled training patterns. We can form a hi
erarchical classiﬁcation of the patterns in Ξ by a simple agglomerative method.
(The description of this algorithm is based on an unpublished manuscript by
Pat Langley.) Our description here gives the general idea; we leave it to the
reader to generate a precise algorithm.
We ﬁrst compute the Euclidean distance between all pairs of patterns in Ξ.
(Again, appropriate scaling of the dimensions is assumed.) Suppose the smallest
distance is between patterns X
i
and X
j
. We collect X
i
and X
j
into a cluster,
C, eliminate X
i
and X
j
from Ξ and replace them by a cluster vector, C, equal
to the average of X
i
and X
j
. Next we compute the Euclidean distance again
between all pairs of points in Ξ. If the smallest distance is between pairs of
patterns, we form a new cluster, C, as before and replace the pair of patterns
in Ξ by their average. If the shortest distance is between a pattern, X
i
, and
a cluster vector, C
j
(representing a cluster, C
j
), we form a new cluster, C,
consisting of the union of C
j
and {X
i
}. In this case, we replace C
j
and X
i
in Ξ by their (appropriately weighted) average and continue. If the shortest
distance is between two cluster vectors, C
i
and C
j
, we form a new cluster, C,
consisting of the union of C
i
and C
j
. In this case, we replace C
i
and C
j
by their
(appropriately weighted) average and continue. Since we reduce the number of
points in Ξ by one each time, we ultimately terminate with a tree of clusters
rooted in the cluster containing all of the points in the original training set.
An example of how this method aggregates a set of two dimensional patterns
is shown in Fig. 9.5. The numbers associated with each cluster indicate the order
in which they were formed. These clusters can be organized hierarchically in a
binary tree with cluster 9 as root, clusters 7 and 8 as the two descendants of the
root, and so on. A ternary tree could be formed instead if one searches for the
three points in Ξ whose triangle deﬁned by those patterns has minimal area.
126 CHAPTER 9. UNSUPERVISED LEARNING
1
2
3
5
4
6
7
8
9
Figure 9.5: Agglommerative Clustering
9.3.2 A Method Based on Probabilities
A probabilistic quality measure for partitions
We can develop a measure of the goodness of a partitioning based on how
accurately we can guess a pattern given only what partition it is in. Suppose
we are given a partitioning of Ξ into R classes, C
1
, . . . , C
R
. As before, we can
compute the sample statistics p(x
i
C
k
) which give probability values for each
component given the class assigned to it by the partitioning. Suppose each
component x
i
of X can take on the values v
ij
, where the index j steps over the
domain of that component. We use the notation p
i
(v
ij
C
k
) = probability(x
i
=
v
ij
C
k
).
Suppose we use the following probabilistic guessing rule about the values
of the components of a vector X given only that it is in class k. Guess that
x
i
= v
ij
with probability p
i
(v
ij
C
k
). Then, the probability that we guess the
ith component correctly is:
¸
j
probability(guess is v
ij
)p
i
(v
ij
C
k
) =
¸
j
[p
i
(v
ij
C
k
)]
2
The average number of (the n) components whose values are guessed correctly
by this method is then given by the sum of these probabilities over all of the
components of X:
¸
i
¸
j
[p
i
(v
ij
C
k
)]
2
9.3. HIERARCHICAL CLUSTERING METHODS 127
Given our partitioning into R classes, the goodness measure, G, of this parti
tioning is the average of the above expression over all classes:
G =
¸
k
p(C
k
)
¸
i
¸
j
[p
i
(v
ij
C
k
)]
2
where p(C
k
) is the probability that a pattern is in class C
k
. In order to penalize
this measure for having a large number of classes, we divide it by R to get an
overall “quality” measure of a partitioning:
Z = (1/R)
¸
k
p(C
k
)
¸
i
¸
j
[p
i
(v
ij
C
k
)]
2
We give an example of the use of this measure for a trivially simple
clustering of the four threedimensional patterns shown in Fig. 9.6. There
are several diﬀerent partitionings. Let’s evaluate Z values for the follow
ing ones: P
1
= {a, b, c, d}, P
2
= {{a, b}, {c, d}}, P
3
= {{a, c}, {b, d}}, and
P
4
= {{a}, {b}, {c}, {d}}. The ﬁrst, P
1
, puts all of the patterns into a single
cluster. The sample probabilities p
i
(v
i1
= 1) and p
i
(v
i0
= 0) are all equal to 1/2
for each of the three components. Summing over the values of the components
(0 and 1) gives (1/2)
2
+ (1/2)
2
= 1/2. Summing over the three components
gives 3/2. Averaging over all of the clusters (there is just one) also gives 3/2.
Finally, dividing by the number of clusters produces the ﬁnal Z value of this
partition, Z(P
1
) = 3/2.
The second partition, P
2
, gives the following sample probabilities:
p
1
(v
11
= 1C
1
) = 1
p
2
(v
21
= 1C
1
) = 1/2
p
3
(v
31
= 1C
1
) = 1
Summing over the values of the components (0 and 1) gives (1)
2
+ (0)
2
= 1 for
component 1, (1/2)
2
+ (1/2)
2
= 1/2 for component 2, and (1)
2
+ (0)
2
= 1 for
component 3. Summing over the three components gives 2 1/2 for class 1. A
similar calculation also gives 2 1/2 for class 2. Averaging over the two clusters
also gives 2 1/2. Finally, dividing by the number of clusters produces the ﬁnal
Z value of this partition, Z(P
2
) = 1 1/4, not quite as high as Z(P
1
).
Similar calculations yield Z(P
3
) = 1 and Z(P
4
) = 3/4, so this method of
evaluating partitions would favor placing all patterns in a single cluster.
128 CHAPTER 9. UNSUPERVISED LEARNING
x
2
x
3
x
1
a
b
c
d
Figure 9.6: Patterns in 3Dimensional Space
An iterative method for hierarchical clustering
Evaluating all partitionings of m patterns and then selecting the best would be
computationally intractable. The following iterative method is based on a hi
erarchical clustering procedure called COBWEB [Fisher, 1987]. The procedure
grows a tree each node of which is labeled by a set of patterns. At the end
of the process, the root node contains all of the patterns in Ξ. The successors
of the root node will contain mutually exclusive and exhaustive subsets of Ξ.
In general, the successors of a node, η, are labeled by mutually exclusive and
exhaustive subsets of the pattern set labelling node η. The tips of the tree will
contain singleton sets. The method uses Z values to place patterns at the vari
ous nodes; sample statistics are used to update the Z values whenever a pattern
is placed at a node. The algorithm is as follows:
a. We start with a tree whose root node contains all of the patterns in Ξ
and a single empty successor node. We arrange that at all times dur
ing the process every nonempty node in the tree has (besides any other
successors) exactly one empty successor.
b. Select a pattern X
i
in Ξ (if there are no more patterns to select, terminate).
c. Set µ to the root node.
d. For each of the successors of µ (including the empty successor!), calculate
the best host for X
i
. A best host is determined by tentatively placing
X
i
in one of the successors and calculating the resulting Z value for each
9.3. HIERARCHICAL CLUSTERING METHODS 129
one of these ways of accomodating X
i
. The best host corresponds to the
assignment with the highest Z value.
e. If the best host is an empty node, η, we place X
i
in η, generate an empty
successor node of η, generate an empty sibling node of η, and go to 2.
f. If the best host is a nonempty, singleton (tip) node, η, we place X
i
in η,
create one successor node of η containing the singleton pattern that was
in η, create another successor node of η containing X
i
, create an empty
successor node of η, create empty successor nodes of the new nonempty
successors of η, and go to 2.
g. If the best host is a nonempty, nonsingleton node, η, we place X
i
in η,
set µ to η, and go to 4.
This process is rather sensitive to the order in which patterns are presented.
To make the ﬁnal classiﬁcation tree less order dependent, the COBWEB proce
dure incorporates node merging and splitting.
Node merging:
It may happen that two nodes having the same parent could be merged with
an overall increase in the quality of the resulting classiﬁcation performed by the
successors of that parent. Rather than try all pairs to merge, a good heuristic
is to attempt to merge the two best hosts. When such a merging improves the
Z value, a new node containing the union of the patterns in the merged nodes
replaces the merged nodes, and the two nodes that were merged are installed
as successors of the new node.
Node splitting:
A heuristic for node splitting is to consider replacing the best host among a
group of siblings by that host’s successors. This operation is performed only if
it increases the Z value of the classiﬁcation performed by a group of siblings.
Example results from COBWEB
We mention two experiments with COBWEB. In the ﬁrst, the program at
tempted to ﬁnd two categories (we will call them Class 1 and Class 2) of United
States Senators based on their votes (yes or no) on six issues. After the clus
ters were established, the majority vote in each class was computed. These are
shown in the table below.
Issue Class 1 Class 2
Toxic Waste yes no
Budget Cuts yes no
SDI Reduction no yes
Contra Aid yes no
LineItem Veto yes no
MX Production yes no
130 CHAPTER 9. UNSUPERVISED LEARNING
In the second experiment, the program attempted to classify soybean dis
eases based on various characteristics. COBWEB grouped the diseases in the
taxonomy shown in Fig. 9.7.
N
0
soybean
diseases
N
1
Diaporthe
Stem Canker
N
2
Charcoal
Rot
N
3
N
31
Rhizoctonia
Rot
N
32
Phytophthora
Rot
Figure 9.7: Taxonomy Induced for Soybean Diseases
9.4 Bibliographical and Historical Remarks
To be added.
Chapter 10
TemporalDiﬀerence
Learning
10.1 Temporal Patterns and Prediction Prob
lems
In this chapter, we consider problems in which we wish to learn to predict the
future value of some quantity, say z, from an ndimensional input pattern, X.
In many of these problems, the patterns occur in temporal sequence, X
1
, X
2
,
. . ., X
i
, X
i+1
, . . ., X
m
, and are generated by a dynamical process. The
components of X
i
are features whose values are available at time, t = i. We
distinguish two kinds of prediction problems. In one, we desire to predict the
value of z at time t = i + 1 based on input X
i
for every i. For example, we
might wish to predict some aspects of tomorrow’s weather based on a set of
measurements made today. In the other kind of prediction problem, we desire
to make a sequence of predictions about the value of z at some ﬁxed time, say
t = m + 1, based on each of the X
i
, i = 1, . . . , m. For example, we might wish
to make a series of predictions about some aspect of the weather on next New
Year’s Day, based on measurements taken every day before New Year’s. Sutton
[Sutton, 1988] has called this latter problem, multistep prediction, and that is
the problem we consider here. In multistep prediction, we might expect that
the prediction accuracy should get better and better as i increases toward m.
10.2 Supervised and TemporalDiﬀerence Meth
ods
A training method that naturally suggests itself is to use the actual value of
z at time m + 1 (once it is known) in a supervised learning procedure using a
131
132 CHAPTER 10. TEMPORALDIFFERENCE LEARNING
sequence of training patterns, {X
1
, X
2
, . . ., X
i
, X
i+1
, . . ., X
m
}. That is, we
seek to learn a function, f, such that f(X
i
) is as close as possible to z for each i.
Typically, we would need a training set, Ξ, consisting of several such sequences.
We will show that a method that is better than supervised learning for some
important problems is to base learning on the diﬀerence between f(X
i+1
) and
f(X
i
) rather than on the diﬀerence between z and f(X
i
). Such methods involve
what is called temporaldiﬀerence (TD) learning.
We assume that our prediction, f(X), depends on a vector of modiﬁable
weights, W. To make that dependence explicit, we write f(X, W). For su
pervised learning, we consider procedures of the following type: For each X
i
,
the prediction f(X
i
, W) is computed and compared to z, and the learning rule
(whatever it is) computes the change, (∆W
i
), to be made to W. Then, taking
into account the weight changes for each pattern in a sequence all at once after
having made all of the predictions with the old weight vector, we change W as
follows:
W←− W+
m
¸
i=1
(∆W)
i
Whenever we are attempting to minimize the squared error between z and
f(X
i
, W) by gradient descent, the weightchanging rule for each pattern is:
(∆W)
i
= c(z −f
i
)
∂f
i
∂W
where c is a learning rate parameter, f
i
is our prediction of z, f(X
i
, W),
at time t = i, and
∂fi
∂W
is, by deﬁnition, the vector of partial derivatives
(
∂fi
∂w1
, . . . ,
∂fi
∂wi
, . . . ,
∂fi
∂wn
) in which the w
i
are the individual components of W.
(The expression
∂fi
∂W
is sometimes written ∇
W
f
i
.) The reader will recall that
we used an equivalent expression for (∆W)
i
in deriving the backpropagation
formulas used in training multilayer neural networks.
The WidrowHoﬀ rule results when f(X, W) = X• W. Then:
(∆W)
i
= c(z −f
i
)X
i
An interesting form for (∆W)
i
can be developed if we note that
(z −f
i
) =
m
¸
k=i
(f
k+1
−f
k
)
where we deﬁne f
m+1
= z. Substituting in our formula for (∆W)
i
yields:
(∆W)
i
= c(z −f
i
)
∂f
i
∂W
10.2. SUPERVISED AND TEMPORALDIFFERENCE METHODS 133
= c
∂f
i
∂W
m
¸
k=i
(f
k+1
−f
k
)
In this form, instead of using the diﬀerence between a prediction and the value
of z, we use the diﬀerences between successive predictions—thus the phrase
temporaldiﬀerence (TD) learning.
In the case when f(X, W) = X • W, the temporal diﬀerence form of the
WidrowHoﬀ rule is:
(∆W)
i
= cX
i
m
¸
k=i
(f
k+1
−f
k
)
One reason for writing (∆W)
i
in temporaldiﬀerence form is to permit an
interesting generalization as follows:
(∆W)
i
= c
∂f
i
∂W
m
¸
k=i
λ
(k−i)
(f
k+1
−f
k
)
where 0 < λ ≤ 1. Here, the λ term gives exponentially decreasing weight to
diﬀerences later in time than t = i. When λ = 1, we have the same rule with
which we began—weighting all diﬀerences equally, but as λ →0, we weight only
the (f
i+1
−f
i
) diﬀerence. With the λ term, the method is called TD(λ).
It is interesting to compare the two extreme cases:
For TD(0):
(∆W)
i
= c(f
i+1
−f
i
)
∂f
i
∂W
For TD(1):
(∆W)
i
= c(z −f
i
)
∂f
i
∂W
Both extremes can be handled by the same learning mechanism; only the error
term is diﬀerent. In TD(0), the error is the diﬀerence between successive predic
tions, and in TD(1), the error is the diﬀerence between the ﬁnally revealed value
of z and the prediction. Intermediate values of λ take into account diﬀerently
weighted diﬀerences between future pairs of successive predictions.
Only TD(1) can be considered a pure supervised learning procedure, sensitive
to the ﬁnal value of z provided by the teacher. For λ < 1, we have various degrees
of unsupervised learning, in which the prediction function strives to make each
prediction more like successive ones (whatever they might be). We shall soon
see that these unsupervised procedures result in better learning than do the
supervised ones for an important class of problems.
134 CHAPTER 10. TEMPORALDIFFERENCE LEARNING
10.3 Incremental Computation of the (∆W)
i
We can rewrite our formula for (∆W)
i
, namely
(∆W)
i
= c
∂f
i
∂W
m
¸
k=i
λ
(k−i)
(f
k+1
−f
k
)
to allow a type of incremental computation. First we write the expression for
the weight change rule that takes into account all of the (∆W)
i
:
W←− W+
m
¸
i=1
c
∂f
i
∂W
m
¸
k=i
λ
(k−i)
(f
k+1
−f
k
)
Interchanging the order of the summations yields:
W←− W+
m
¸
k=1
c
k
¸
i=1
λ
(k−i)
(f
k+1
−f
k
)
∂f
i
∂W
= W+
m
¸
k=1
c(f
k+1
−f
k
)
k
¸
i=1
λ
(k−i)
∂f
i
∂W
Interchanging the indices k and i ﬁnally yields:
W←− W+
m
¸
i=1
c(f
i+1
−f
i
)
i
¸
k=1
λ
(i−k)
∂f
k
∂W
If, as earlier, we want to use an expression of the form W←− W+
¸
m
i=1
(∆W)
i
,
we see that we can write:
(∆W)
i
= c(f
i+1
−f
i
)
i
¸
k=1
λ
(i−k)
∂f
k
∂W
Now, if we let e
i
=
¸
i
k=1
λ
(i−k) ∂f
k
∂W
, we can develop a computationally eﬃcient
recurrence equation for e
i+1
as follows:
e
i+1
=
i+1
¸
k=1
λ
(i+1−k)
∂f
k
∂W
=
∂f
i+1
∂W
+
i
¸
k=1
λ
(i+1−k)
∂f
k
∂W
10.4. AN EXPERIMENT WITH TD METHODS 135
=
∂f
i+1
∂W
+λe
i
Rewriting (∆W)
i
in these terms, we obtain:
(∆W)
i
= c(f
i+1
−f
i
)e
i
where:
e
1
=
∂f
1
∂W
e
2
=
∂f
2
∂W
+λe
1
etc.
Quoting Sutton [Sutton, 1988, page 15] (about a diﬀerent equation, but the
quote applies equally well to this one):
“. . . this equation can be computed incrementally, because each
(∆W)
i
depends only on a pair of successive predictions and on the
[weighted] sum of all past values for
∂fi
∂W
. This saves substantially on
memory, because it is no longer necessary to individually remember
all past values of
∂fi
∂W
.”
10.4 An Experiment with TD Methods
TD prediction methods [especially TD(0)] are well suited to situations in which
the patterns are generated by a dynamic process. In that case, sequences of
temporally presented patterns contain important information that is ignored
by a conventional supervised method such as the WidrowHoﬀ rule. Sutton
[Sutton, 1988, page 19] gives an interesting example involving a random walk,
which we repeat here. In Fig. 10.1, sequences of vectors, X, are generated as
follows: We start with vector X
D
; the next vector in the sequence is equally
likely to be one of the adjacent vectors in the diagram. If the next vector is
X
C
(or X
E
), the next one after that is equally likely to be one of the vectors
adjacent to X
C
(or X
E
). When X
B
is in the sequence, it is equally likely that
the sequence terminates with z = 0 or that the next vector is X
C
. Similarly,
when X
F
is in the sequence, it is equally likely that the sequence terminates
with z = 1 or that the next vector is X
E
. Thus the sequences are random, but
they always start with X
D
. Some sample sequences are shown in the ﬁgure.
136 CHAPTER 10. TEMPORALDIFFERENCE LEARNING
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
z = 0
z = 1
X
B
X
C
X
D
X
E
X
F
Typical Sequences:
X
D
X
C
X
D
X
E
X
F
1
X
D
X
C
X
B
X
C
X
D
X
E
X
D
X
E
X
F
1
X
D
X
E
X
D
X
C
X
B
0
Figure 10.1: A Markov Process
This random walk is an example of a Markov process; transitions from state i
to state j occur with probabilities that depend only on i and j.
Given a set of sequences generated by this process as a training set, we want
to be able to predict the value of z for each X in a test sequence. We assume
that the learning system does not know the transition probabilities.
For his experiments with this process, Sutton used a linear predictor, that
is f(X, W) = X• W. The learning problem is to ﬁnd a weight vector, W, that
minimizes the meansquared error between z and the predicted value of z. Given
the ﬁve diﬀerent values that X can take on, we have the following predictions:
f(X
B
) = w
1
, f(X
C
) = w
2
, f(X
D
) = w
3
, f(X
E
) = w
4
, f(X
F
) = w
5
, where
w
i
is the ith component of the weight vector. (Note that the values of the
predictions are not limited to 1 or 0—even though z can only have one of
those values—because we are minimizing meansquared error.) After training,
these predictions will be compared with the optimal ones—given the transition
probabilities.
The experimental setup was as follows: ten random sequences were generated
using the transition probabilities. Each of these sequences was presented in turn
to a TD(λ) method for various values of λ. Weight vector increments, (∆W)
i
,
were computed after each pattern presentation but no weight changes were
made until all ten sequences were presented. The weight vector increments were
summed after all ten sequences were presented, and this sum was used to change
the weight vector to be used for the next pass through the ten sequences. This
process was repeated over and over (using the same training sequences) until
(quoting Sutton) “the procedure no longer produced any signiﬁcant changes in
the weight vector. For small c, the weight vector always converged in this way,
10.4. AN EXPERIMENT WITH TD METHODS 137
and always to the same ﬁnal value [for 100 diﬀerent training sets of ten random
sequences], independent of its initial value.” (Even though, for ﬁxed, small c,
the weight vector always converged to the same vector, it might converge to a
somewhat diﬀerent vector for diﬀerent values of c.)
After convergence, the predictions made by the ﬁnal weight vector are com
pared with the optimal predictions made using the transition probabilities.
These optimal predictions are simply p(z = 1X). We can compute these proba
bilities to be 1/6, 1/3, 1/2, 2/3, and 5/6 for X
B
, X
C
, X
D
, X
E
, X
F
, respectively.
The rootmeansquared diﬀerences between the best learned predictions (over
all c) and these optimal ones are plotted in Fig. 10.2 for seven diﬀerent values
of λ. (For each data point, the standard error is approximately σ = 0.01.)
0.10
0.12
0.14
0.16
0.18
0.20
0.0 0.1 0.3 0.5 0.7 0.9 1.0
h
Error using
best c
WidrowHoff
TD(1)
TD(0)
(Adapted from Sutton, p. 20, 1988)
Figure 10.2: Prediction Errors for TD(λ)
Notice that the WidrowHoﬀ procedure does not perform as well as other
versions of TD(λ) for λ < 1! Quoting [Sutton, 1988, page 21]:
“This result contradicts conventional wisdom. It is well known that,
under repeated presentations, the WidrowHoﬀ procedure minimizes
the RMS error between its predictions and the actual outcomes in
the training set ([Widrow & Stearns, 1985]). How can it be that this
optimal method peformed worse than all the TD methods for λ <
1? The answer is that the WidrowHoﬀ procedure only minimizes
error on the training set; it does not necessarily minimize error for
future experience. [Later] we prove that in fact it is linear TD(0)
that converges to what can be considered the optimal estimates for
138 CHAPTER 10. TEMPORALDIFFERENCE LEARNING
matching future experience—those consistent with the maximum
likelihood estimate of the underlying Markov process.”
10.5 Theoretical Results
It is possible to analyze the performance of the linearprediction TD(λ) methods
on Markov processes. We state some theorems here without proof.
Theorem 10.1 (Sutton, page 24, 1988) For any absorbing Markov chain,
and for any linearly independent set of observation vectors {X
i
} for the non
terminal states, there exists an ε > 0 such that for all positive c < ε and for any
initial weight vector, the predictions of linear TD(0) (with weight updates after
each sequence) converge in expected value to the optimal (maximum likelihood)
predictions of the true process.
Even though the expected values of the predictions converge, the predictions
themselves do not converge but vary around their expected values depending on
their most recent experience. Sutton conjectures that if c is made to approach
0 as training progresses, the variance of the predictions will approach 0 also.
Dayan [Dayan, 1992] has extended the result of Theorem 9.1 to TD(λ) for
arbitrary λ between 0 and 1. (Also see [Dayan & Sejnowski, 1994].)
10.6 IntraSequence Weight Updating
Our standard weight updating rule for TD(λ) methods is:
W←− W+
m
¸
i=1
c(f
i+1
−f
i
)
i
¸
k=1
λ
(i−k)
∂f
k
∂W
where the weight update occurs after an entire sequence is observed. To make
the method truly incremental (in analogy with weight updating rules for neural
nets), it would be desirable to change the weight vector after every pattern
presentation. The obvious extension is:
W
i+1
←− W
i
+c(f
i+1
−f
i
)
i
¸
k=1
λ
(i−k)
∂f
k
∂W
where f
i+1
is computed before making the weight change; that is, f
i+1
=
f(X
i+1
, W
i
). But that would make f
i
= f(X
i
, W
i−1
), and such a rule would
make the prediction diﬀerence, namely (f
i+1
−f
i
), sensitive both to changes in
X and changes in W and could lead to instabilities. Instead, we modify the rule
so that, for every pair of predictions, f
i+1
= f(X
i+1
, W
i
) and f
i
= f(X
i
, W
i
).
This version of the rule has been used in practice with excellent results.
10.6. INTRASEQUENCE WEIGHT UPDATING 139
For TD(0) and linear predictors, the rule is:
W
i+1
= W
i
+c(f
i+1
−f
i
)X
i
The rule is implemented as follows:
a. Initialize the weight vector, W, arbitrarily.
b. For i = 1, ..., m, do:
(a) f
i
←− X
i
• W
(We compute f
i
anew each time through rather than use the value
of f
i+1
the previous time through.)
(b) f
i+1
←− X
i+1
• W
(c) d
i+1
←− f
i+1
−f
i
(d) W←− W+c d
i+1
X
i
(If f
i
were computed again with this changed weight vector, its value
would be closer to f
i+1
as desired.)
The linear TD(0) method can be regarded as a technique for training a
very simple network consisting of a single dot product unit (and no threshold
or sigmoid function). TD methods can also be used in combination with back
propagation to train neural networks. For TD(0) we change the network weights
according to the expression:
W
i+1
= W
i
+c(f
i+1
−f
i
)
∂f
i
∂W
The only change that must be made to the standard backpropagation weight
changing rule is that the diﬀerence term between the desired output and the
output of the unit in the ﬁnal (kth) layer, namely (d −f
(k)
), must be replaced
by a diﬀerence term between successive outputs, (f
i+1
−f
i
). This change has a
direct eﬀect only on the expression for δ
(k)
which becomes:
δ
(k)
= 2(f
(k)
−f
(k)
)f
(k)
(1 −f
(k)
)
where f
(k)
and f
(k)
are two successive outputs of the network.
The weight changing rule for the ith weight vector in the jth layer of weights
has the same form as before, namely:
W
(j)
i
←− W
(j)
i
+cδ
(j)
i
X
(j−1)
where the δ
(j)
i
are given recursively by:
δ
(j)
i
= f
(j)
i
(1 −f
(j)
i
)
mj+1
¸
l=1
δ
(j+1)
l
w
(j+1)
il
and w
(j+1)
il
is the lth component of the ith weight vector in the (j +1)th layer
of weights. Of course, here also it is assumed that f
(k)
and f
(k)
are computed
using the same weights and then the weights are changed. In the next section
we shall see an interesting example of this application of TD learning.
140 CHAPTER 10. TEMPORALDIFFERENCE LEARNING
10.7 An Example Application: TDgammon
A program called TDgammon [Tesauro, 1992] learns to play backgammon by
training a neural network via temporaldiﬀerence methods. The structure of
the neural net, and its coding is as shown in Fig. 10.3. The network is trained
to minimize the error between actual payoﬀ and estimated payoﬀ, where the
actual payoﬀ is deﬁned to be d
f
= p
1
+2p
2
−p
3
−2p
4
, and the p
i
are the actual
probabilities of the various outcomes as deﬁned in the ﬁgure.
. . .
p
3
= pr(black wins)
p
4
= pr(black gammons)
p
1
= pr(white wins)
p
2
= pr(white gammons)
estimated payoff:
d = p
1
+ 2p
2
< p
3
< 2p
4
no. of white
on cell 1
no. on bar,
off board,
and who
moves
198 inputs
1
2
3
# > 3
. . .
up to 40 hidden units
2 x 24
cells
4 output units
hidden and output units are sigmoids
learning rate: c = 0.1; initial weights chosen
randomly between <0.5 and +0.5.
estimated probabilities:
Figure 10.3: The TDgammon Network
TDgammon learned by using the network to select that move that results
in the best predicted payoﬀ. That is, at any stage of the game some ﬁnite set of
moves is possible and these lead to the set, {X}, of new board positions. Each
member of this set is evaluated by the network, and the one with the largest
10.8. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 141
predicted payoﬀ is selected if it is white’s move (and the smallest if it is black’s).
The move is made, and the network weights are adjusted to make the predicted
payoﬀ from the original position closer to that of the resulting position.
The weight adjustment procedure combines temporaldiﬀerence (TD(λ))
learning with backpropagation. If d
t
is the network’s estimate of the payoﬀ
at time t (before a move is made), and d
t+1
is the estimate at time t + 1 (after
a move is made), the weight adjustment rule is:
∆W
t
= c(d
t+1
−d
t
)
t
¸
k=1
λ
t−k
∂d
k
∂W
where W
t
is a vector of all weights in the network at time t, and
∂d
k
∂W
is the
gradient of d
k
in this weight space. (For a layered, feedforward network, such
as that of TDgammon, the weight changes for the weight vectors in each layer
can be expressed in the usual manner.)
To make the special cases clear, recall that for TD(0), the network would be
trained so that, for all t, its output, d
t
, for input X
t
tended toward its expected
output, d
t+1
, for input X
t+1
. For TD(1), the network would be trained so that,
for all t, its output, d
t
, for input X
t
tended toward the expected ﬁnal payoﬀ,
d
f
, given that input. The latter case is the same as the WidrowHoﬀ rule.
After about 200,000 games the following results were obtained. TDgammon
(with 40 hidden units, λ = 0.7, and c = 0.1) won 66.2% of 10,000 games against
SUN Microsystems Gammontool and 55% of 10,000 games against a neural
network trained using expert moves. Commenting on a later version of TD
gammon, incorporating special features as inputs, Tesauro said: “It appears to
be the strongest program ever seen by this author.”
10.8 Bibliographical and Historical Remarks
To be added.
142 CHAPTER 10. TEMPORALDIFFERENCE LEARNING
Chapter 11
DelayedReinforcement
Learning
11.1 The General Problem
Imagine a robot that exists in an environment in which it can sense and act.
Suppose (as an extreme case) that it has no idea about the eﬀects of its actions.
That is, it doesn’t know how acting will change its sensory inputs. Along with
its sensory inputs are “rewards,” which it occasionally receives. How should it
choose its actions so as to maximize its rewards over the long run? To maximize
rewards, it will need to be able to predict how actions change inputs, and in
particular, how actions lead to rewards.
We formalize the problem in the following way: The robot exists in an
environment consisting of a set, S, of states. We assume that the robot’s sensory
apparatus constructs an input vector, X, from the environment, which informs
the robot about which state the environment is in. For the moment, we will
assume that the mapping from states to vectors is onetoone, and, in fact, will
use the notation X to refer to the state of the environment as well as to the
input vector. When presented with an input vector, the robot decides which
action from a set, A, of actions to perform. Performing the action produces an
eﬀect on the environment—moving it to a new state. The new state results in
the robot perceiving a new input vector, and the cycle repeats. We assume a
discrete time model; the input vector at time t = i is X
i
, the action taken at
that time is a
i
, and the expected reward, r
i
, received at t = i depends on the
action taken and on the state, that is r
i
= r(X
i
, a
i
). The learner’s goal is to ﬁnd
a policy, π(X), that maps input vectors to actions in such a way that maximizes
rewards accumulated over time. This type of learning is called reinforcement
learning. The learner must ﬁnd the policy by trial and error; it has no initial
knowledge of the eﬀects of its actions. The situation is as shown in Fig. 11.1.
143
144 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
X
i
r
i
Learner
Environment
(reward)
(state)
(action)
a
i
Figure 11.1: Reinforcement Learning
11.2 An Example
A “grid world,” such as the one shown in Fig. 11.2 is often used to illustrate
reinforcement learning. Imagine a robot initially in cell (2,3). The robot receives
input vector (x
1
, x
2
) telling it what cell it is in; it is capable of four actions,
n, e, s, w moving the robot one cell up, right, down, or left, respectively. It is
rewarded one negative unit whenever it bumps into the wall or into the blocked
cells. For example, if the input to the robot is (1,3), and the robot chooses
action w, the next input to the robot is still (1,3) and it receives a reward of
−1. If the robot lands in the cell marked G (for goal), it receives a reward of
+10. Let’s suppose that whenever the robot lands in the goal cell and gets its
reward, it is immediately transported out to some random cell, and the quest
for reward continues.
A policy for our robot is a speciﬁcation of what action to take for every one
of its inputs, that is, for every one of the cells in the grid. For example, a com
ponent of such a policy would be “when in cell (3,1), move right.” An optimal
policy is a policy that maximizes longterm reward. One way of displaying a
policy for our gridworld robot is by an arrow in each cell indicating the direc
tion the robot should move when in that cell. In Fig. 11.3, we show an optimal
policy displayed in this manner. In this chapter we will describe methods for
learning optimal policies based on reward values received by the learner.
11.3. TEMPORAL DISCOUNTING AND OPTIMAL POLICIES 145
R
G
1 2 3 4 5 6 7
1
2
3
4
5
6
7
8
Figure 11.2: A Grid World
11.3 Temporal Discounting and Optimal Poli
cies
In delayed reinforcement learning, one often assumes that rewards in the distant
future are not as valuable as are more immediate rewards. This preference can
be accomodated by a temporal discount factor, 0 ≤ γ < 1. The present value of
a reward, r
i
, occuring i time units in the future, is taken to be γ
i
r
i
. Suppose
we have a policy π(X) that maps input vectors into actions, and let r
π(X)
i
be
the reward that will be received on the ith time step after one begins executing
policy π starting in state X. Then the total reward accumulated over all time
steps by policy π beginning in state X is:
V
π
(X) =
∞
¸
i=0
γ
i
r
π(X)
i
One reason for using a temporal discount factor is so that the above sum will
be ﬁnite. An optimal policy is one that maximizes V
π
(X) for all inputs, X.
In general, we want to consider the case in which the rewards, r
i
, are random
variables and in which the eﬀects of actions on environmental states are random.
In Markovian environments, for example, the probability that action a in state
X
i
will lead to state X
j
is given by a transition probability p[X
j
X
i
, a]. Then,
we will want to maximize expected future reward and would deﬁne V
π
(X) as:
V
π
(X) = E
¸
∞
¸
i=0
γ
i
r
π(X)
i
¸
In either case, we call V
π
(X) the value of policy π for input X.
146 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
R
G
1 2 3 4 5 6 7
1
2
3
4
5
6
7
8
Figure 11.3: An Optimal Policy in the Grid World
If the action prescribed by π taken in state X leads to state X
(randomly
according to the transition probabilities), then we can write V
π
(X) in terms of
V
π
(X
) as follows:
V
π
(X) = r[X, π(X)] +γ
¸
X
p[X
X, π(X)]V
π
(X
)
where (in summary):
γ = the discount factor,
V
π
(X) = the value of state X under policy π,
r[X, π(X)] = the expected immediate reward received when we execute the
action prescribed by π in state X, and
p[X
X, π(X)] = the probability that the environment transitions to state
X
when we execute the action prescribed by π in state X.
In other words, the value of state X under policy π is the expected value of
the immediate reward received when executing the action recommended by π
plus the average value (under π) of all of the states accessible from X.
For an optimal policy, π
∗
(and no others!), we have the famous “optimality
equation:”
V
π
∗
(X) = max
a
¸
r(X, a) +γ
¸
X
p[X
X, a]V
π
∗
(X
)
¸
The theory of dynamic programming (DP) [Bellman, 1957, Ross, 1983] assures
us that there is at least one optimal policy, π
∗
, that satisﬁes this equation. DP
11.4. QLEARNING 147
also provides methods for calculating V
π
∗
(X) and at least one π
∗
, assuming
that we know the average rewards and the transition probabilities. If we knew
the transition probabilities, the average rewards, and V
π
∗
(X) for all X and a,
then it would be easy to implement an optimal policy. We would simply select
that a that maximizes r(X, a) +γ
¸
X
p[X
X, a]V
π
∗
(X
). That is,
π
∗
(X) = arg max
a
¸
r(X, a) +γ
¸
X
p[X
X, a]V
π
∗
(X
)
¸
But, of course, we are assuming that we do not know these average rewards nor
the transition probabilities, so we have to ﬁnd a method that eﬀectively learns
them.
If we had a model of actions, that is, if we knew for every state, X, and
action a, which state, X
resulted, then we could use a method called value
iteration to ﬁnd an optimal policy. Value iteration works as follows: We begin
by assigning, randomly, an estimated value
ˆ
V (X) to every state, X. On the ith
step of the process, suppose we are at state X
i
(that is, our input on the ith
step is X
i
), and that the estimated value of state X
i
on the ith step is
ˆ
V
i
(X
i
).
We then select that action a that maximizes the estimated value of the predicted
subsequent state. Suppose this subsequent state having the highest estimated
value is X
i
. Then we update the estimated value,
ˆ
V
i
(X
i
), of state X
i
as follows:
ˆ
V
i
(X) = (1 −c
i
)
ˆ
V
i−1
(X) +c
i
r
i
+γ
ˆ
V
i−1
(X
i
)
if X = X
i
,
=
ˆ
V
i−1
(X)
otherwise.
We see that this adjustment moves the value of
ˆ
V
i
(X
i
) an increment (depend
ing on c
i
) closer to
r
i
+γ
ˆ
V
i
(X
i
)
. Assuming that
ˆ
V
i
(X
i
) is a good estimate for
V
i
(X
i
), then this adjustment helps to make the two estimates more consistent.
Providing that 0 < c
i
< 1 and that we visit each state inﬁnitely often, this
process of value iteration will converge to the optimal values. Discuss synchronous dynamic
programming, asynchronous
dynamic programming, and policy
iteration.
11.4 QLearning
Watkins [Watkins, 1989] has proposed a technique that he calls incremental
dynamic programming. Let a; π stand for the policy that chooses action a once,
and thereafter chooses actions according to policy π. We deﬁne:
Q
π
(X, a) = V
a;π
(X)
148 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
Then the optimal value from state X is given by:
V
π
∗
(X) = max
a
Q
π
∗
(X, a)
This equation holds only for an optimal policy, π
∗
. The optimal policy is given
by:
π
∗
(X) = arg max
a
Q
π
∗
(X, a)
Note that if an action a makes Q
π
(X, a) larger than V
π
(X), then we can improve
π by changing it so that π(X) = a. Making such a change is the basis for a
powerful learning rule that we shall describe shortly.
Suppose action a in state X leads to state X
. Then using the deﬁnitions of
Q and V , it is easy to show that:
Q
π
(X, a) = r(X, a) +γE[V
π
(X
)]
where r(X, a) is the average value of the immediate reward received when we
execute action a in state X. For an optimal policy (and no others), we have
another version of the optimality equation in terms of Q values:
Q
π
∗
(X, a) = max
a
r(X, a) +γE
Q
π
∗
(X
, a)
for all actions, a, and states, X. Now, if we had the optimal Q values (for all
a and X), then we could implement an optimal policy simply by selecting that
action that maximized r(X, a) +γE
Q
π
∗
(X
, a)
.
That is,
π
∗
(X) = arg max
a
r(X, a) +γE
Q
π
∗
(X
, a)
Watkins’ proposal amounts to a TD(0) method of learning the Q values.
We quote (with minor notational changes) from [Watkins & Dayan, 1992, page
281]:
“In QLearning, the agent’s experience consists of a sequence of dis
tinct stages or episodes. In the ith episode, the agent:
• observes its current state X
i
,
• selects [using the method described below] and performs an
action a
i
,
• observes the subsequent state X
i
,
• receives an immediate reward r
i
, and
11.4. QLEARNING 149
• adjusts its Q
i−1
values using a learning factor c
i
, according to:
Q
i
(X, a) = (1 −c
i
)Q
i−1
(X, a) +c
i
[r
i
+γV
i−1
(X
i
)]
if X = X
i
and a = a
i
,
= Q
i−1
(X, a)
otherwise,
where
V
i−1
(X
) = max
b
[Q
i−1
(X
, b)]
is the best the agent thinks it can do from state X
. . . . The
initial Q values, Q
0
(X, a), for all states and actions are assumed
given.”
Using the current Q values, Q
i
(X, a), the agent always selects that action
that maximizes Q
i
(X, a). Note that only the Q value corresponding to the
state just exited and the action just taken is adjusted. And that Q value is
adjusted so that it is closer (by an amount determined by c
i
) to the sum of
the immediate reward plus the discounted maximum (over all actions) of the Q
values of the state just entered. If we imagine the Q values to be predictions of
ultimate (inﬁnite horizon) total reward, then the learning procedure described
above is exactly a TD(0) method of learning how to predict these Q values.
Q learning strengthens the usual TD methods, however, because TD (applied
to reinforcement problems using value iteration) requires a onestep lookahead,
using a model of the eﬀects of actions, whereas Q learning does not.
A convenient notation (proposed by [Schwartz, 1993]) for representing the
change in Q value is:
Q(X, a)
β
←− r +γV (X
)
where Q(X, a) is the new Q value for input X and action a, r is the immediate
reward when action a is taken in response to input X, V (X
) is the maximum
(over all actions) of the Q value of the state next reached when action a is taken
from state X, and β is the fraction of the way toward which the new Q value,
Q(X, a), is adjusted to equal r +γV (X
).
Watkins and Dayan [Watkins & Dayan, 1992] prove that, under certain con
ditions, the Q values computed by this learning procedure converge to optimal
ones (that is, to ones on which an optimal policy can be based).
We deﬁne n
i
(X, a) as the index (episode number) of the ith time that action
a is tried in state X. Then, we have:
150 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
Theorem 11.1 (Watkins and Dayan) For Markov problems with states {X}
and actions {a}, and given bounded rewards r
n
 ≤ R, learning rates 0 ≤ c
n
< 1,
and
∞
¸
i=0
c
n
i
(X,a)
= ∞,
∞
¸
i=0
c
n
i
(X,a)
2
< ∞
for all X and a, then
Q
n
(X, a) → Q
∗
n
(X, a) as n → ∞, for all X and a, with probability 1, where
Q
∗
n
(X, a) corresponds to the Q values of an optimal policy.
Again, we quote from [Watkins & Dayan, 1992, page 281]:
“The most important condition implicit in the convergence theorem
. . . is that the sequence of episodes that forms the basis of learning
must include an inﬁnite number of episodes for each starting state
and action. This may be considered a strong condition on the way
states and actions are selected—however, under the stochastic con
ditions of the theorem, no method could be guaranteed to ﬁnd an
optimal policy under weaker conditions. Note, however, that the
episodes need not form a continuous sequence—that is the X
of one
episode need not be the X of the next episode.”
The relationships among Q learning, dynamic programming, and control
are very well described in [Barto, Bradtke, & Singh, 1994]. Q learning is best
thought of as a stochastic approximation method for calculating the Q values.
Although the deﬁnition of the optimal Q values for any state depends recursively
on expected values of the Q values for subsequent states (and on the expected
values of rewards), no expected values are explicitly computed by the procedure.
Instead, these values are approximated by iterative sampling using the actual
stochastic mechanism that produces successor states.
11.5 Discussion, Limitations, and Extensions of
QLearning
11.5.1 An Illustrative Example
The Qlearning procedure requires that we maintain a table of Q(X, a) values
for all stateaction pairs. In the grid world that we described earlier, such a
table would not be excessively large. We might start with random entries in the
table; a portion of such an intial table might be as follows:
11.5. DISCUSSION, LIMITATIONS, ANDEXTENSIONS OF QLEARNING151
X a Q(X, a) r(X, a)
(2,3) w 7 0
(2,3) n 4 0
(2,3) e 3 0
(2,3) s 6 0
(1,3) w 4 1
(1,3) n 5 0
(1,3) e 2 0
(1,3) s 4 0
Suppose the robot is in cell (2,3). The maximum Q value occurs for a = w, so the
robot moves west to cell (1,3)—receiving no immediate reward. The maximum
Q value in cell (1,3) is 5, and the learning mechanism attempts to make the
value of Q((2, 3), w) closer to the discounted value of 5 plus the immediate
reward (which was 0 in this case). With a learning rate parameter c = 0.5
and γ = 0.9, the Q value of Q((2, 3), w) is adjusted from 7 to 5.75. No other
changes are made to the table at this episode. The reader might try this learning
procedure on the grid world with a simple computer program. Notice that an
optimal policy might not be discovered if some cells are not visited nor some
actions not tried frequently enough.
The learning problem faced by the agent is to associate speciﬁc actions with
speciﬁc input patterns. Q learning gradually reinforces those actions that con
tribute to positive rewards by increasing the associated Q values. Typically, as
in this example, rewards occur somewhat after the actions that lead to them—
hence the phrase delayedreinforcement learning. One can imagine that better
and better approximations to the optimal Q values gradually propagate back
from states producing rewards toward all of the other states that the agent fre
quently visits. With random Q values to begin, the agent’s actions amount to a
random walk through its space of states. Only when this random walk happens
to stumble into rewarding states does Q learning begin to produce Q values
that are useful, and, even then, the Q values have to work their way outward
from these rewarding states. The general problem of associating rewards with
stateaction pairs is called the temporal credit assignment problem—how should
credit for a reward be apportioned to the actions leading up to it? Q learning is,
to date, the most successful technique for temporal credit assignment, although
a related method, called the bucket brigade algorithm, has been proposed by
[Holland, 1986].
Learning problems similar to that faced by the agent in our grid world have
been thoroughly studied by Sutton who has proposed an architecture, called
DYNA, for solving them [Sutton, 1990]. DYNA combines reinforcement learning
with planning. Sutton characterizes planning as learning in a simulated world
that models the world that the agent inhabits. The agent’s model of the world
is obtained by Q learning in its actual world, and planning is accomplished by
Q learning in its model of the world.
We should note that the learning problem faced by our gridworld robot
152 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
could be modiﬁed to have several places in the grid that give positive rewards.
This possibility presents an interesting way to generalize the classical notion of
a “goal” in AI planning systems—even in those that do no learning. Instead of
representing a goal as a condition to be achieved, we represent a “goal struc
ture” as a set of rewards to be given for achieving various conditions. Then,
the generalized “goal” becomes maximizing discounted future reward instead of
simply achieving some particular condition. This generalization can be made to
encompass socalled goals of maintenance and goals of avoidance. The exam
ple presented above included avoiding bumping into the gridworld boundary.
A goal of maintenance, of a particular state, could be expressed in terms of a
reward that was earned whenever the agent was in that state and performed an
action that transitioned back to that state in one step.
11.5.2 Using Random Actions
When the next pattern presentation in a sequence of patterns is the one caused
by the agent’s own action in response to the last pattern, we have what is called
an online learning method. In Watkins and Dayan’s terminology, in online
learning the episodes form a continous sequence. As already mentioned, the
convergence theorem for Q learning does not require online learning; indeed,
special precautions must be taken to ensure that online learning meets the
conditions of the theorem. If online learning discovers some good paths to
rewards, the agent may ﬁxate on these and never discover a policy that leads
to a possibly greater longterm reward. In reinforcement learning phraseology,
this problem is referred to as the problem of exploitation (of already learned
behavior) versus exploration (of possibly better behavior).
One way to force exploration is to perform occasional random actions (in
stead of that single action prescribed by the current Q values). For example,
in the gridworld problem, one could imagine selecting an action randomly ac
cording to a probability distribution over the actions (n, e, s, and w). This
distribution, in turn, could depend on the Q values. For example, we might
ﬁrst ﬁnd that action prescribed by the Q values and then choose that action
with probability 1/2, choose the two orthogonal actions with probability 3/16
each, and choose the opposite action with probability 1/8. This policy might be
modiﬁed by “simulated annealing” which would gradually increase the probabil
ity of the action prescribed by the Q values more and more as time goes on. This
strategy would favor exploration at the beginning of learning and exploitation
later.
Other methods, also, have been proposed for dealing with exploration, in
cluding making unvisited states intrinsically rewarding and using an “interval
estimate,” which is related to the uncertainty in the estimate of a state’s value
[Kaelbling, 1993].
11.5. DISCUSSION, LIMITATIONS, ANDEXTENSIONS OF QLEARNING153
11.5.3 Generalizing Over Inputs
For large problems it would be impractical to maintain a table like that used
in our gridworld example. Various researchers have suggested mechanisms for
computing Q values, given pattern inputs and actions. One method that sug
gests itself is to use a neural network. For example, consider the simple linear
machine shown in Fig. 11.4.
X
. . .
. . .
Y
Y
Y
trainable weights
Y
W
i
R dot product units
Q(a
i
, X) = X
.
W
i
Q(a
1
, X)
Q(a
2
, X)
Q(a
R
, X)
Figure 11.4: A Net that Computes Q Values
Such a neural net could be used by an agent that has R actions to select
from. The Q values (as a function of the input pattern X and the action a
i
) are
computed as dot products of weight vectors (one for each action) and the input
vector. Weight adjustments are made according to a TD(0) procedure to bring
the Q value for the action last selected closer to the sum of the immediate reward
(if any) and the (discounted) maximum Q value for the next input pattern.
If the optimum Q values for the problem (whatever they might be) are more
complex than those that can be computed by a linear machine, a layered neural
network might be used. Sigmoid units in the ﬁnal layer would compute Q values
in the range 0 to 1. The TD(0) method for updating Q values would then have to
be combined with a multilayer weightchanging rule, such as backpropagation.
Networks of this sort are able to aggregate diﬀerent input vectors into regions
for which the same action should be performed. This kind of aggregation is an
example of what has been called structural credit assignment. Combining TD(λ)
and backpropagation is a method for dealing with both the temporal and the
structural credit assignment problems.
154 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
Interesting examples of delayedreinforcement training of simulated and
actual robots requiring structural credit assignment have been reported by
[Lin, 1992, Mahadevan & Connell, 1992].
11.5.4 Partially Observable States
So far, we have identiﬁed the input vector, X, with the actual state of the envi
ronment. When the input vector results from an agent’s perceptual apparatus
(as we assume it does), there is no reason to suppose that it uniquely identiﬁes
the environmental state. Because of inevitable perceptual limitations, several
diﬀerent environmental states might give rise to the same input vector. This
phenomenon has been referred to as perceptual aliasing. With perceptual alias
ing, we can no longer guarantee that Q learning will result in even useful action
policies, let alone optimal ones. Several researchers have attempted to deal with
this problem using a variety of methods including attempting to model “hid
den” states by using internal memory [Lin, 1993]. That is, if some aspect of
the environment cannot be sensed currently, perhaps it was sensed once and
can be remembered by the agent. When such is the case, we no longer have a
Markov problem; that is, the next X vector, given any action, may depend on
a sequence of previous ones rather than just the immediately preceding one. It
might be possible to reinstate a Markov framework (over the X’s) if X includes
not only current sensory precepts but information from the agent’s memory.
11.5.5 Scaling Problems
Several diﬃculties have so far prohibited wide application of reinforcement learn
ing to large problems. (The TDgammon program, mentioned in the last chap
ter, is probably unique in terms of success on a highdimensional problem.)
We have already touched on some diﬃculties; these and others are summarized
below with references to attempts to overcome them.
a. Exploration versus exploitation.
• use random actions
• favor states not visited recently
• separate the learning phase from the use phase
• employ a teacher to guide exploration
b. Slow time to convergence
• combine learning with prior knowledge; use estimates of Q values
(rather than random values) initially
• use a hierarchy of actions; learn primitive actions ﬁrst and freeze the
useful sequences into macros and then learn how to use the macros
11.6. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 155
• employ a teacher; use graded “lessons”—starting near the rewards
and then backing away, and use examples of good behavior [Lin, 1992]
• use more eﬃcient computations; e.g. do several updates per episode
[Moore & Atkeson, 1993]
c. Large state spaces
• use handcoded features
• use neural networks
• use nearestneighbor methods [Moore, 1990]
d. Temporal discounting problems. Using small γ can make the learner too
greedy for present rewards and indiﬀerent to the future; but using large γ
slows down learning.
• use a learning method based on average rewards [Schwartz, 1993]
e. No “transfer” of learning . What is learned depends on the reward struc
ture; if the rewards change, learning has to start over.
• Separate the learning into two parts: learn an “action model” which
predicts how actions change states (and is constant over all prob
lems), and then learn the “values” of states by reinforcement learn
ing for each diﬀerent set of rewards. Sometimes the reinforcement
learning part can be replaced by a “planner” that uses the action
model to produce plans to achieve goals.
Also see other articles in the special issue on reinforcement learning: Machine
Learning, 8, May, 1992.
11.6 Bibliographical and Historical Remarks
To be added.
156 CHAPTER 11. DELAYEDREINFORCEMENT LEARNING
Chapter 12
ExplanationBased
Learning
12.1 Deductive Learning
In the learning methods studied so far, typically the training set does not ex
haust the version space. Using logical terminology, we could say that the classi
ﬁer’s output does not logically follow from the training set. In this sense, these
methods are inductive. In logic, a deductive system is one whose conclusions
logically follow from a set of input facts, if the system is sound.
1
To contrast inductive with deductive systems in a logical setting, suppose
we have a set of facts (the training set) that includes the following formulas:
{Round(Obj1), Round(Obj2), Round(Obj3), Round(Obj4),
Ball(Obj1), Ball(Obj2), Ball(Obj3), Ball(Obj4)}
A learning system that forms the conclusion (∀x)[Ball(x) ⊃ Round(x)] is in
ductive. This conclusion may be useful (if there are no facts of the form
Ball(σ) ∧ ¬Round(σ)), but it does not logically follow from the facts. On the
other hand, if we had the facts Green(Obj5) and Green(Obj5) ⊃ Round(Obj5),
then we could logically conclude Round(Obj5). Making this conclusion and sav
ing it is an instance of deductive learning—a topic we study in this chapter.
Suppose that some logical proposition, φ, logically follows from some set of
facts, ∆. Under what circumstances might we say that the process of deducing
φ from ∆ results in our learning φ? In a sense, we implicitly knew φ all along,
since it was inherent in knowing ∆. Yet, φ might not be obvious given ∆, and
1
Logical reasoning systems that are not sound, for example those using nonmonotonic
reasoning, themselves might produce inductive conclusions that do not logically follow from
the input facts.
157
158 CHAPTER 12. EXPLANATIONBASED LEARNING
the deduction process to establish φ might have been arduous. Rather than have
to deduce φ again, we might want to save it, perhaps along with its deduction,
in case it is needed later. Shouldn’t that process count as learning? Dietterich
[Dietterich, 1990] has called this type of learning speedup learning.
Strictly speaking, speedup learning does not result in a system being able to
make decisions that, in principle, could not have been made before the learning
took place. Speedup learning simply makes it possible to make those decisions
more eﬃciently. But, in practice, this type of learning might make possible
certain decisions that might otherwise have been infeasible.
To take an extreme case, a chess player can be said to learn chess even though
optimal play is inherent in the rules of chess. On the surface, there seems to be
no real diﬀerence between the experiencebased hypotheses that a chess player
makes about what constitutes good play and the kind of learning we have been
studying so far.
As another example, suppose we are given some theorems about geometry
and are asked to prove that the sum of the angles of a right triangle is 180
degrees. Let us further suppose that the proof we constructed did not depend
on the given triangle being a right triangle; in that case we can learn a more
general fact. The learning technique that we are going to study next is related
to this example. It is called explanationbased learning (EBL). EBL can be
thought of as a process in which implicit knowledge is converted into explicit
knowledge.
In EBL, we specialize parts of a domain theory to explain a particular ex
ample, then we generalize the explanation to produce another element of the
domain theory that will be useful on similar examples. This process is illustrated
in Fig. 12.1.
12.2 Domain Theories
Two types of information were present in the inductive methods we have studied:
the information inherent in the training samples and the information about the
domain that is implied by the “bias” (for example, the hypothesis set from which
we choose functions). The learning methods are successful only if the hypothesis
set is appropriate for the problem. Typically, the smaller the hypothesis set (that
is, the more a priori information we have about the function being sought), the
less dependent we are on information being supplied by a training set (that
is, fewer samples). A priori information about a problem can be expressed in
several ways. The methods we have studied so far restrict the hypotheses in a
rather direct way. A less direct method involves making assertions in a logical
language about the property we are trying to learn. A set of such assertions is
usually called a “domain theory.”
Suppose, for example, that we wanted to classify people according to whether
or not they were good credit risks. We might represent a person by a set of
properties (income, marital status, type of employment, etc.), assemble such
12.3. AN EXAMPLE 159
Domain
Theory
Example
(X is P)
Prove: X is P
specialize
Explanation
(Proof)
generalize
A New Domain Rule:
Things "like" X are P
Y is like X
Complex Proof
Process
Trivial Proof
Y is P
Figure 12.1: The EBL Process
data about people who are known to be good and bad credit risks and train a
classiﬁer to make decisions. Or, we might go to a loan oﬃcer of a bank, ask him
or her what sorts of things s/he looks for in making a decision about a loan,
encode this knowledge into a set of rules for an expert system, and then use
the expert system to make decisions. The knowledge used by the loan oﬃcer
might have originated as a set of “policies” (the domain theory), but perhaps the
application of these policies were specialized and made more eﬃcient through
experience with the special cases of loans made in his or her district.
12.3 An Example
To make our discussion more concrete, let’s consider the following fanciful exam
ple. We want to ﬁnd a way to classify robots as “robust” or not. The attributes
that we use to represent a robot might include some that are relevant to this
decision and some that are not.
160 CHAPTER 12. EXPLANATIONBASED LEARNING
Suppose we have a domain theory of logical sentences that taken together,
help to deﬁne whether or not a robot can be classiﬁed as robust. (The same
domain theory may be useful for several other purposes also, but among other
things, it describes the concept “robust.”)
In this example, let’s suppose that our domain theory includes the sentences:
Fixes(u, u) ⊃ Robust(u)
(An individual that can ﬁx itself is robust.)
Sees(x, y) ∧ Habile(x) ⊃ Fixes(x, y)
(A habile individual that can see another entity can ﬁx that entity.)
Robot(w) ⊃ Sees(w, w)
(All robots can see themselves.)
R2D2(x) ⊃ Habile(x)
(R2D2class individuals are habile.)
C3PO(x) ⊃ Habile(x)
(C3POclass individuals are habile.)
. . .
(By convention, variables are assumed to be universally quantiﬁed.) We could
use theoremproving methods operating on this domain theory to conclude
whether certain robots are robust. These methods might be computationally
quite expensive because extensive search may have to be performed to derive a
conclusion. But after having found a proof for some particular robot, we might
be able to derive some new sentence whose use allows a much faster conclusion.
We next show how such a new rule might be derived in this example. Suppose
we are given a number of facts about Num5, such as:
Robot(Num5)
R2D2(Num5)
Age(Num5, 5)
Manufacturer(Num5, GR)
. . .
12.3. AN EXAMPLE 161
Fixes(u, u) => Robust(u)
Robust(Num5)
Fixes(Num5, Num5)
Sees(Num5,Num5)
Habile(Num5)
Sees(x,y) & Habile(x)
=> Fixes(x,y)
Robot(w)
=> Sees(w,w)
Robot(Num5)
R2D2(x)
=> Habile(x)
R2D2(Num5)
Figure 12.2: A Proof Tree
We are also told that Robust(Num5) is true, but we nevertheless attempt to
ﬁnd a proof of that assertion using these facts about Num5 and the domain
theory. The facts about Num5 correspond to the features that we might use
to represent Num5. In this example, not all of them are relevant to a decision
about Robust(Num5). The relevant ones are those used or needed in proving
Robust(Num5) using the domain theory. The proof tree in Fig. 12.2 is one that
a typical theoremproving system might produce.
In the language of EBL, this proof is an explanation for the fact
Robust(Num5). We see from this explanation that the only facts about Num5
that were used were Robot(Num5) and R2D2(Num5). In fact, we could con
struct the following rule from this explanation:
Robot(Num5) ∧ R2D2(Num5) ⊃ Robust(Num5)
The explanation has allowed us to prune some attributes about Num5 that are
irrelevant (at least for deciding Robust(Num5)). This type of pruning is the ﬁrst
sense in which an explanation is used to generalize the classiﬁcation problem.
([DeJong & Mooney, 1986] call this aspect of explanationbased learning feature
elimination.) But the rule we extracted from the explanation applies only to
Num5. There might be little value in learning that rule since it is so speciﬁc.
Can it be generalized so that it can be applied to other individuals as well?
162 CHAPTER 12. EXPLANATIONBASED LEARNING
Examination of the proof shows that the same proof structure, using the
same sentences from the domain theory, could be used independently of whether
we are talking about Num5 or some other individual. We can generalize the
proof by a process that replaces constants in the tip nodes of the proof tree
with variables and works upward—using uniﬁcation to constrain the values of
variables as needed to obtain a proof.
In this example, we replace Robot(Num5) by Robot(r) and R2D2(Num5)
by R2D2(s) and redo the proof—using the explanation proof as a template.
Note that we use diﬀerent values for the two diﬀerent occurrences of Num5 at
the tip nodes. Doing so sometimes results in more general, but nevertheless
valid rules. We now apply the rules used in the proof in the forward direction,
keeping track of the substitutions imposed by the most general uniﬁers used in
the proof. (Note that we always substitute terms that are already in the tree for
variables in rules.) This process results in the generalized proof tree shown in
Fig. 12.3. Note that the occurrence of Sees(r, r) as a node in the tree forces the
uniﬁcation of x with y in the domain rule, Sees(x, y)∧Habile(y) ⊃ Fixes(x, y).
The substitutions are then applied to the variables in the tip nodes and the root
node to yield the general rule: Robot(r) ∧ R2D2(r) ⊃ Robust(r).
This rule is the end result of EBL for this example. The process
by which Num5 in this example was generalized to a variable is what
[DeJong & Mooney, 1986] call identity elimination (the precise identity of Num5
turned out to be irrelevant). (The generalization process described in this ex
ample is based on that of [DeJong & Mooney, 1986] and diﬀers from that of
[Mitchell, et al., 1986]. It is also similar to that used in [Fikes, et al., 1972].)
Clearly, under certain assumptions, this general rule is more easily used to con
clude Robust about an individual than the original proof process was.
It is important to note that we could have derived the general rule from the
domain theory without using the example. (In the literature, doing so is called
static analysis [Etzioni, 1991].) In fact, the example told us nothing new other
than what it told us about Num5. The sole role of the example in this instance
of EBL was to provide a template for a proof to help guide the generalization
process. Basing the generalization process on examples helps to insure that we
learn rules matched to the distribution of problems that occur.
There are a number of qualiﬁcations and elaborations about EBL that need
to be mentioned.
12.4 Evaluable Predicates
The domain theory includes a number of predicates other than the one occuring
in the formula we are trying to prove and other than those that might custom
arily be used to describe an individual. One might note, for example, that if we
used Habile(Num5) to describe Num5, the proof would have been shorter. Why
didn’t we? The situation is analogous to that of using a data base augmented
by logical rules. In the latter application, the formulas in the actual data base
12.4. EVALUABLE PREDICATES 163
Robust(r)
Fixes(r, r)
Sees(r,r) Habile(s)
Robot(r) R2D2(s)
{r/w}
{s/x}
{r/x, r/y, r/s}
{r/u}
Robot(w)
=> Sees(w,w)
R2D2(x)
=> Habile(x)
Sees(x,y) & Habile(x)
=> Fixes(x,y)
Fixes(u, u) => Robust(u)
becomes R2D2(r) after
applying {r/s}
Figure 12.3: A Generalized Proof Tree
are “extensional,” and those in the logical rules are “intensional.” This usage
reﬂects the fact that the predicates in the data base part are deﬁned by their
extension—we explicitly list all the tuples sastisfying a relation. The logical
rules serve to connect the data base predicates with higher level abstractions
that are described (if not deﬁned) by the rules. We typically cannot look up
the truth values of formulas containing these intensional predicates; they have
to be derived using the rules and the database.
The EBL process assumes something similar. The domain theory is useful
for connecting formulas that we might want to prove with those whose truth
values can be “looked up” or otherwise evaluated. In the EBL literature, such
formulas satisfy what is called the operationality criterion. Perhaps another
analogy might be to neural networks. The evaluable predicates correspond to
the components of the input pattern vector; the predicates in the domain theory
correspond to the hidden units. Finding the new rule corresponds to ﬁnding a
simpler expression for the formula to be proved in terms only of the evaluable
predicates.
164 CHAPTER 12. EXPLANATIONBASED LEARNING
12.5 More General Proofs
Examining the domain theory for our example reveals that an alternative rule
might have been: Robot(u) ∧ C3PO(u) ⊃ Robust(u). Such a rule might
have resulted if we were given {C3PO(Num6), Robot(Num6), . . .} and proved
Robust(Num6). After considering these two examples (Num5 and Num6),
the question arises, do we want to generalize the two rules to something like:
Robot(u)∧[C3PO(u)∨R2D2(u)] ⊃ Robust(u)? Doing so is an example of what
[DeJong & Mooney, 1986] call structural generalization (via disjunctive augmen
tation ).
Adding disjunctions for every alternative proof can soon become cumbersome
and destroy any eﬃciency advantage of EBL. In our example, the eﬃciency
might be retrieved if there were another evaluable predicate, say, Bionic(u) such
that the domain theory also contained R2D2(x) ⊃ Bionic(x) and C3PO(x) ⊃
Bionic(x). After seeing a number of similar examples, we might be willing to
induce the formula Bionic(u) ⊃ [C3PO(u) ∨ R2D2(u)] in which case the rule
with the disjunction could be replaced with Robot(u) ∧Bionic(u) ⊃ Robust(u).
12.6 Utility of EBL
It is well known in theorem proving that the complexity of ﬁnding a proof
depends both on the number of formulas in the domain theory and on the depth
of the shortest proof. Adding a new rule decreases the depth of the shortest
proof but it also increases the number of formulas in the domain theory. In
realistic applications, the added rules will be relevant for some tasks and not for
others. Thus, it is unclear whether the overall utility of the new rules will turn
out to be positive. EBL methods have been applied in several settings, usually
with positive utility. (See [Minton, 1990] for an analysis).
12.7 Applications
There have been several applications of EBL methods. We mention two here,
namely the formation of macrooperators in automatic plan generation and
learning how to control search.
12.7.1 MacroOperators in Planning
In automatic planning systems, eﬃciency can sometimes be enhanced by chain
ing together a sequence of operators into macrooperators. We show an exam
ple of a process for creating macrooperators based on techniques explored by
[Fikes, et al., 1972].
Referring to Fig. 12.4, consider the problem of ﬁnding a plan for a robot in
room R1 to fetch a box, B1, by going to an adjacent room, R2, and pushing it
12.7. APPLICATIONS 165
back to R1. The goal for the robot is INROOM(B1, R1), and the facts that
are true in the initial state are listed in the ﬁgure.
R1
R2
R3
D1
D2
B1
Initial State:
INROOM(ROBOT, R1)
INROOM(B1,R2)
CONNECTS(D1,R1,R2)
CONNECTS(D1,R2,R1)
. . .
Figure 12.4: Initial State of a Robot Problem
We will construct the plan from a set of STRIPS operators that include:
GOTHRU(d, r1, r2)
Preconditions: INROOM(ROBOT, r1), CONNECTS(d, r1, r2)
Delete list: INROOM(ROBOT, r1)
Add list: INROOM(ROBOT, r2)
PUSHTHRU(b, d, r1, r2)
Preconditions: INROOM(ROBOT, r1), CONNECTS(d, r1, r2), INROOM(b, r1)
Delete list: INROOM(ROBOT, r1), INROOM(b, r1)
Add list: INROOM(ROBOT, r2), INROOM(b, r2)
A backwardreasoning STRIPS system might produce the plan shown in
Fig. 12.5. We show there the main goal and the subgoals along a solution path.
(The conditions in each subgoal that are true in the initial state are shown
underlined.) The preconditions for this plan, true in the initial state, are:
INROOM(ROBOT, R1)
166 CHAPTER 12. EXPLANATIONBASED LEARNING
CONNECTS(D1, R1, R2)
CONNECTS(D1, R2, R1)
INROOM(B1, R2)
Saving this speciﬁc plan, valid only for the speciﬁc constants it mentions, would
not be as useful as would be saving a more general one. We ﬁrst generalize
these preconditions by substituting variables for constants. We then follow the
structure of the speciﬁc plan to produce the generalized plan shown in Fig. 12.6
that achieves INROOM(b1, r4). Note that the generalized plan does not require
pushing the box back to the place where the robot started. The preconditions
for the generalized plan are:
INROOM(ROBOT, r1)
CONNECTS(d1, r1, r2)
CONNECTS(d2, r2, r4)
INROOM(b, r4)
INROOM(B1,R1)
PUSHTHRU(B1,d,r1,R1)
INROOM(ROBOT, r1),
CONNECTS(d, r1, R1),
INROOM(B1, r1)
INROOM(ROBOT, R2),
CONNECTS(D1, R2, R1),
INROOM(B1, R2)
{R2/r1,
D1/d}
GOTHRU(d2, r3, R2)
INROOM(ROBOT, r3),
CONNECTS(d2, r3, R2),
CONNECTS(D1, R2, R1),
INROOM(B1, R2)
{R1/r3, D1/d2}
INROOM(ROBOT, R1),
CONNECTS(D1, R1, R2),
CONNECTS(D1, R2, R1),
INROOM(B1, R2)
R1
R2
R3
D1
D2
GOTHRU(D1,R1,R2)
PUSHTHRU(B1,D1,R2,R1)
B1
PLAN:
Figure 12.5: A Plan for the Robot Problem
Another related technique that chains together sequences of operators to
form more general ones is the chunking mechanism in Soar [Laird, et al., 1986].
12.7. APPLICATIONS 167
INROOM(b1,r4)
PUSHTHRU(b1,d2,r2,r4)
INROOM(ROBOT, r2),
CONNECTS(d1, r1, r2),
CONNECTS(d2, r2, r4),
INROOM(b1, r4)
GOTHRU(d1, r1, r2)
INROOM(ROBOT, r1),
CONNECTS(d1, r1, r2),
CONNECTS(d2, r2, r4),
INROOM(b1, r4)
Figure 12.6: A Generalized Plan
12.7.2 Learning Search Control Knowledge
Besides their use in creating macrooperators, EBL methods can be used to
improve the eﬃciency of planning in another way also. In his system called
PRODIGY, Minton proposed using EBL to learn eﬀective ways to control
search [Minton, 1988]. PRODIGY is a STRIPSlike system that solves planning
problems in the blocksworld, in a simple mobile robot world, and in jobshop
scheduling. PRODIGY has a domain theory involving both the domain of the
problem and a simple (meta) theory about planning. Its meta theory includes
statements about whether a control choice about a subgoal to work on, an oper
ator to apply, etc. either succeeds or fails. After producing a plan, it analyzes its
successful and its unsuccessful choices and attempts to explain them in terms
of its domain theory. Using an EBLlike process, it is able to produce useful
control rules such as:
168 CHAPTER 12. EXPLANATIONBASED LEARNING
IF (AND (CURRENT −NODE node)
(CANDIDATE −GOAL node (ON x y))
(CANDIDATE −GOAL node (ON y z)))
THEN (PREFER GOAL (ON y z) TO (ON x y))
PRODIGY keeps statistics on how often these learned rules are used, their
savings (in time to ﬁnd plans), and their cost of application. It saves only the
rules whose utility, thus measured, is judged to be high. Minton [Minton, 1990]
has shown that there is an overall advantage of using these rules (as against not
having any rules and as against handcoded search control rules).
12.8 Bibliographical and Historical Remarks
To be added.
Bibliography
[Acorn & Walden, 1992] Acorn, T., and Walden, S., “SMART: Support Man
agement Automated Reasoning Technology for COMPAQ Customer Ser
vice,” Proc. Fourth Annual Conf. on Innovative Applications of Artiﬁcial
Intelligence, Menlo Park, CA: AAAI Press, 1992.
[Aha, 1991] Aha, D., Kibler, D., and Albert, M., “InstanceBased Learning
Algorithms,” Machine Learning, 6, 3766, 1991.
[Anderson & Bower, 1973] Anderson, J. R., and Bower, G. H., Human Asso
ciative Memory, Hillsdale, NJ: Erlbaum, 1973.
[Anderson, 1958] Anderson, T. W., An Introduction to Multivariate Statistical
Analysis, New York: John Wiley, 1958.
[Barto, Bradtke, & Singh, 1994] Barto, A., Bradtke, S., and Singh, S., “Learn
ing to Act Using RealTime Dynamic Programming,” to appear in Ar
tiﬁcial Intelligence, 1994.
[Baum & Haussler, 1989] Baum, E, and Haussler, D., “What Size Net Gives
Valid Generalization?” Neural Computation, 1, pp. 151160, 1989.
[Baum, 1994] Baum, E., “When Are kNearest Neighbor and Backpropagation
Accurate for FeasibleSized Sets of Examples?” in Hanson, S., Drastal,
G., and Rivest, R., (eds.), Computational Learning Theory and Natural
Learning Systems, Volume 1: Constraints and Prospects, pp. 415442,
Cambridge, MA: MIT Press, 1994.
[Bellman, 1957] Bellman, R. E., Dynamic Programming, Princeton: Princeton
University Press, 1957.
[Blumer, et al., 1987] Blumer, A., et al., “Occam’s Razor,” Info. Process. Lett.,
vol 24, pp. 37780, 1987.
[Blumer, et al., 1990] Blumer, A., et al., “Learnability and the Vapnik
Chervonenkis Dimension,” JACM, 1990.
[Bollinger & Duﬃe, 1988] Bollinger, J., and Duﬃe, N., Computer Control of
Machines and Processes, Reading, MA: AddisonWesley, 1988.
169
170 BIBLIOGRAPHY
[Brain, et al., 1962] Brain, A. E., et al., “Graphical Data Processing Research
Study and Experimental Investigation,” Report No. 8 (pp. 913) and No.
9 (pp. 310), Contract DA 36039 SC78343, SRI International, Menlo
Park, CA, June 1962 and September 1962.
[Breiman, et al., 1984] Breiman, L., Friedman, J., Olshen, R., and Stone, C.,
Classiﬁcation and Regression Trees, Monterey, CA: Wadsworth, 1984.
[Brent, 1990] Brent, R. P., “Fast Training Algorithms for MultiLayer Neural
Nets,” Numerical Analysis Project Manuscript NA9003, Computer Sci
ence Department, Stanford University, Stanford, CA 94305, March 1990.
[Bryson & Ho 1969] Bryson, A., and Ho, Y.C., Applied Optimal Control, New
York: Blaisdell.
[Buchanan & Wilkins, 1993] Buchanan, B. and Wilkins, D., (eds.), Readings in
Knowledge Acquisition and Learning, San Francisco: Morgan Kaufmann,
1993.
[Carbonell, 1983] Carbonell, J., “Learning by Analogy,” in Machine Learning:
An Artiﬁcial Intelligence Approach, Michalski, R., Carbonell, J., and
Mitchell, T., (eds.), San Francisco: Morgan Kaufmann, 1983.
[Cheeseman, et al., 1988] Cheeseman, P., et al., “AutoClass: A Bayesian Clas
siﬁcation System,” Proc. Fifth Intl. Workshop on Machine Learning,
Morgan Kaufmann, San Mateo, CA, 1988. Reprinted in Shavlik, J. and
Dietterich, T., Readings in Machine Learning, Morgan Kaufmann, San
Francisco, pp. 296306, 1990.
[Cover & Hart, 1967] Cover, T., and Hart, P., “Nearest Neighbor Pattern Clas
siﬁcation,” IEEE Trans. on Information Theory, 13, 2127, 1967.
[Cover, 1965] Cover, T., “Geometrical and Statistical Properties of Systems
of Linear Inequalities with Applications in Pattern Recognition,” IEEE
Trans. Elec. Comp., EC14, 326334, June, 1965.
[Dasarathy, 1991] Dasarathy, B. V., Nearest Neighbor Pattern Classiﬁcation
Techniques, IEEE Computer Society Press, 1991.
[Dayan & Sejnowski, 1994] Dayan, P., and Sejnowski, T., “TD(λ) Converges
with Probability 1,” Machine Learning, 14, pp. 295301, 1994.
[Dayan, 1992] Dayan, P., “The Convergence of TD(λ) for General λ,” Machine
Learning, 8, 341362, 1992.
[DeJong & Mooney, 1986] DeJong, G., and Mooney, R., “ExplanationBased
Learning: An Alternative View,” Machine Learning, 1:145176, 1986.
Reprinted in Shavlik, J. and Dietterich, T., Readings in Machine Learn
ing, San Francisco: Morgan Kaufmann, 1990, pp 452467.
BIBLIOGRAPHY 171
[Dietterich & Bakiri, 1991] Dietterich, T. G., and Bakiri, G., “ErrorCorrecting
Output Codes: A General Method for Improving Multiclass Induc
tive Learning Programs,” Proc. Ninth Nat. Conf. on A.I., pp. 572577,
AAAI91, MIT Press, 1991.
[Dietterich, et al., 1990] Dietterich, T., Hild, H., and Bakiri, G., “A Compara
tive Study of ID3 and Backpropagation for English TexttoSpeech Map
ping,” Proc. Seventh Intl. Conf. Mach. Learning, Porter, B. and Mooney,
R. (eds.), pp. 2431, San Francisco: Morgan Kaufmann, 1990.
[Dietterich, 1990] Dietterich, T., “Machine Learning,” Annu. Rev. Comput.
Sci., 4:255306, Palo Alto: Annual Reviews Inc., 1990.
[Duda & Fossum, 1966] Duda, R. O., and Fossum, H., “Pattern Classiﬁcation
by Iteratively Determined Linear and Piecewise Linear Discriminant
Functions,” IEEE Trans. on Elect. Computers, vol. EC15, pp. 220232,
April, 1966.
[Duda & Hart, 1973] Duda, R. O., and Hart, P.E., Pattern Classiﬁcation and
Scene Analysis, New York: Wiley, 1973.
[Duda, 1966] Duda, R. O., “Training a Linear Machine on Mislabeled Patterns,”
SRI Tech. Report prepared for ONR under Contract 3438(00), SRI In
ternational, Menlo Park, CA, April 1966.
[Efron, 1982] Efron, B., The Jackknife, the Bootstrap and Other Resampling
Plans, Philadelphia: SIAM, 1982.
[Ehrenfeucht, et al., 1988] Ehrenfeucht, A., et al., “A General Lower Bound on
the Number of Examples Needed for Learning,” in Proc. 1988 Workshop
on Computational Learning Theory, pp. 110120, San Francisco: Morgan
Kaufmann, 1988.
[Etzioni, 1991] Etzioni, O., “STATIC: A ProblemSpace Compiler for
PRODIGY,” Proc. of Ninth National Conf. on Artiﬁcial Intelligence,
pp. 533540, Menlo Park: AAAI Press, 1991.
[Etzioni, 1993] Etzioni, O., “A Structural Theory of ExplanationBased Learn
ing,” Artiﬁcial Intelligence, 60:1, pp. 93139, March, 1993.
[Evans & Fisher, 1992] Evans, B., and Fisher, D., Process Delay Analyses Using
DecisionTree Induction, Tech. Report CS9206, Department of Com
puter Science, Vanderbilt University, TN, 1992.
[Fahlman & Lebiere, 1990] Fahlman, S., and Lebiere, C., “The Cascade
Correlation Learning Architecture,” in Touretzky, D., (ed.), Advances in
Neural Information Processing Systems, 2, pp. 524532, San Francisco:
Morgan Kaufmann, 1990.
172 BIBLIOGRAPHY
[Fayyad, et al., 1993] Fayyad, U. M., Weir, N., and Djorgovski, S., “SKICAT:
A Machine Learning System for Automated Cataloging of Large Scale
Sky Surveys,” in Proc. Tenth Intl. Conf. on Machine Learning, pp. 112
119, San Francisco: Morgan Kaufmann, 1993. (For a longer version of
this paper see: Fayyad, U. Djorgovski, G., and Weir, N., “Automating
the Analysis and Cataloging of Sky Surveys,” in Fayyad, U., et al.(eds.),
Advances in Knowledge Discovery and Data Mining, Chapter 19, pp.
471ﬀ., Cambridge: The MIT Press, March, 1996.)
[Feigenbaum, 1961] Feigenbaum, E. A., “The Simulation of Verbal Learning Be
havior,” Proceedings of the Western Joint Computer Conference, 19:121
132, 1961.
[Fikes, et al., 1972] Fikes, R., Hart, P., and Nilsson, N., “Learning and Execut
ing Generalized Robot Plans,” Artiﬁcial Intelligence, pp 251288, 1972.
Reprinted in Shavlik, J. and Dietterich, T., Readings in Machine Learn
ing, San Francisco: Morgan Kaufmann, 1990, pp 468486.
[Fisher, 1987] Fisher, D., “Knowledge Acquisition via Incremental Conceptual
Clustering,” Machine Learning, 2:139172, 1987. Reprinted in Shavlik,
J. and Dietterich, T., Readings in Machine Learning, San Francisco:
Morgan Kaufmann, 1990, pp. 267–283.
[Friedman, et al., 1977] Friedman, J. H., Bentley, J. L., and Finkel, R. A., “An
Algorithm for Finding Best Matches in Logarithmic Expected Time,”
ACM Trans. on Math. Software, 3(3):209226, September 1977.
[Fu, 1994] Fu, L., Neural Networks in Artiﬁcial Intelligence, New York:
McGrawHill, 1994.
[Gallant, 1986] Gallant, S. I., “Optimal Linear Discriminants,” in Eighth Inter
national Conf. on Pattern Recognition, pp. 849852, New York: IEEE,
1986.
[Genesereth & Nilsson, 1987] Genesereth, M., and Nilsson, N., Logical Founda
tions of Artiﬁcial Intelligence, San Francisco: Morgan Kaufmann, 1987.
[Gluck & Rumelhart, 1989] Gluck, M. and Rumelhart, D., Neuroscience and
Connectionist Theory, The Developments in Connectionist Theory, Hills
dale, NJ: Erlbaum Associates, 1989.
[Hammerstrom, 1993] Hammerstrom, D., “Neural Networks at Work,” IEEE
Spectrum, pp. 2632, June 1993.
[Haussler, 1988] Haussler, D., “Quantifying Inductive Bias: AI Learning Al
gorithms and Valiant’s Learning Framework,” Artiﬁcial Intelligence,
36:177221, 1988. Reprinted in Shavlik, J. and Dietterich, T., Readings in
Machine Learning, San Francisco: Morgan Kaufmann, 1990, pp. 96107.
BIBLIOGRAPHY 173
[Haussler, 1990] Haussler, D., “Probably Approximately Correct Learning,”
Proc. Eighth Nat. Conf. on AI, pp. 11011108. Cambridge, MA: MIT
Press, 1990.
[Hebb, 1949] Hebb, D. O., The Organization of Behaviour, New York: John
Wiley, 1949.
[Hertz, Krogh, & Palmer, 1991] Hertz, J., Krogh, A, and Palmer, R., Introduc
tion to the Theory of Neural Computation, Lecture Notes, vol. 1, Santa
Fe Inst. Studies in the Sciences of Complexity, New York: Addison
Wesley, 1991.
[Hirsh, 1994] Hirsh, H., “Generalizing Version Spaces,” Machine Learning, 17,
545, 1994.
[Holland, 1975] Holland, J., Adaptation in Natural and Artiﬁcial Systems, Ann
Arbor: The University of Michigan Press, 1975. (Second edition printed
in 1992 by MIT Press, Cambridge, MA.)
[Holland, 1986] Holland, J. H., “Escaping Brittleness; The Possibilities of
GeneralPurpose Learning Algorithms Applied to Parallel RuleBased
Systems.” In Michalski, R., Carbonell, J., and Mitchell, T. (eds.) , Ma
chine Learning: An Artiﬁcial Intelligence Approach, Volume 2, chapter
20, San Francisco: Morgan Kaufmann, 1986.
[Hunt, Marin, & Stone, 1966] Hunt, E., Marin, J., and Stone, P., Experiments
in Induction, New York: Academic Press, 1966.
[Jabbour, K., et al., 1987] Jabbour, K., et al., “ALFA: Automated Load Fore
casting Assistant,” Proc. of the IEEE Pwer Engineering Society Summer
Meeting, San Francisco, CA, 1987.
[John, 1995] John, G., “Robust Linear Discriminant Trees,” Proc. of the Conf.
on Artiﬁcial Intelligence and Statistics, Ft. Lauderdale, FL, January,
1995.
[Kaelbling, 1993] Kaelbling, L. P., Learning in Embedded Systems, Cambridge,
MA: MIT Press, 1993.
[Kohavi, 1994] Kohavi, R., “BottomUp Induction of Oblivious ReadOnce De
cision Graphs,” Proc. of European Conference on Machine Learning
(ECML94), 1994.
[Kolodner, 1993] Kolodner, J., CaseBased Reasoning, San Francisco: Morgan
Kaufmann, 1993.
[Koza, 1992] Koza, J., Genetic Programming: On the Programming of Comput
ers by Means of Natural Selection, Cambridge, MA: MIT Press, 1992.
[Koza, 1994] Koza, J., Genetic Programming II: Automatic Discovery of
Reusable Programs, Cambridge, MA: MIT Press, 1994.
174 BIBLIOGRAPHY
[Laird, et al., 1986] Laird, J., Rosenbloom, P., and Newell, A., “Chunking in
Soar: The Anatomy of a General Learning Mechanism,” Machine Learn
ing, 1, pp. 1146, 1986. Reprinted in Buchanan, B. and Wilkins, D.,
(eds.), Readings in Knowledge Acquisition and Learning, pp. 518535,
Morgan Kaufmann, San Francisco, CA, 1993.
[Langley, 1992] Langley, P., “Areas of Application for Machine Learning,” Proc.
of Fifth Int’l. Symp. on Knowledge Engineering, Sevilla, 1992.
[Langley, 1996] Langley, P., Elements of Machine Learning, San Francisco:
Morgan Kaufmann, 1996.
[Lavraˇc & Dˇzeroski, 1994] Lavraˇc, N., and Dˇzeroski, S., Inductive Logic Pro
gramming, Chichester, England: Ellis Horwood, 1994.
[Lin, 1992] Lin, L., “SelfImproving Reactive Agents Based on Reinforcement
Learning, Planning, and Teaching,” Machine Learning, 8, 293321, 1992.
[Lin, 1993] Lin, L., “Scaling Up Reinforcement Learning for Robot Control,”
Proc. Tenth Intl. Conf. on Machine Learning, pp. 182189, San Francisco:
Morgan Kaufmann, 1993.
[Littlestone, 1988] Littlestone, N., “Learning Quickly When Irrelevant At
tributes Abound: A New LinearThreshold Algorithm,” Machine Learn
ing 2: 285318, 1988.
[Maass & Tur´an, 1994] Maass, W., and Tur´an, G., “How Fast Can a Thresh
old Gate Learn?,” in Hanson, S., Drastal, G., and Rivest, R., (eds.),
Computational Learning Theory and Natural Learning Systems, Volume
1: Constraints and Prospects, pp. 381414, Cambridge, MA: MIT Press,
1994.
[Mahadevan & Connell, 1992] Mahadevan, S., and Connell, J., “Automatic
Programming of BehaviorBased Robots Using Reinforcement Learn
ing,” Artiﬁcial Intelligence, 55, pp. 311365, 1992.
[Marchand & Golea, 1993] Marchand, M., and Golea, M., “On Learning Sim
ple Neural Concepts: From Halfspace Intersections to Neural Decision
Lists,” Network, 4:6785, 1993.
[McCulloch & Pitts, 1943] McCulloch, W. S., and Pitts, W. H., “A Logical Cal
culus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathe
matical Biophysics, Vol. 5, pp. 115133, Chicago: University of Chicago
Press, 1943.
[Michie, 1992] Michie, D., “Some Directions in Machine Intelligence,” unpub
lished manuscript, The Turing Institute, Glasgow, Scotland, 1992.
[Minton, 1988] Minton, S., Learning Search Control Knowledge: An
ExplanationBased Approach, Kluwer Academic Publishers, Boston,
MA, 1988.
BIBLIOGRAPHY 175
[Minton, 1990] Minton, S., “Quantitative Results Concerning the Utility of
ExplanationBased Learning,” Artiﬁcial Intelligence, 42, pp. 363392,
1990. Reprinted in Shavlik, J. and Dietterich, T., Readings in Machine
Learning, San Francisco: Morgan Kaufmann, 1990, pp. 573587.
[Mitchell, et al., 1986] Mitchell, T., et al., “ExplanationBased Generalization:
A Unifying View,” Machine Learning, 1:1, 1986. Reprinted in Shavlik,
J. and Dietterich, T., Readings in Machine Learning, San Francisco:
Morgan Kaufmann, 1990, pp. 435451.
[Mitchell, 1982] Mitchell, T., “Generalization as Search,” Artiﬁcial Intelligence,
18:203226, 1982. Reprinted in Shavlik, J. and Dietterich, T., Readings in
Machine Learning, San Francisco: Morgan Kaufmann, 1990, pp. 96–107.
[Moore & Atkeson, 1993] Moore, A., and Atkeson, C., “Prioritized Sweeping:
Reinforcement Learning with Less Data and Less Time,” Machine Learn
ing, 13, pp. 103130, 1993.
[Moore, et al., 1994] Moore, A. W., Hill, D. J., and Johnson, M. P., “An Em
pirical Investigation of Brute Force to Choose Features, Smoothers, and
Function Approximators,” in Hanson, S., Judd, S., and Petsche, T.,
(eds.), Computational Learning Theory and Natural Learning Systems,
Vol. 3, Cambridge: MIT Press, 1994.
[Moore, 1990] Moore, A., Eﬃcient Memorybased Learning for Robot Control,
PhD. Thesis; Technical Report No. 209, Computer Laboratory, Univer
sity of Cambridge, October, 1990.
[Moore, 1992] Moore, A., “Fast, Robust Adaptive Control by Learning Only
Forward Models,” in Moody, J., Hanson, S., and Lippman, R., (eds.),
Advances in Neural Information Processing Systems 4, San Francisco:
Morgan Kaufmann, 1992.
[Mueller & Page, 1988] Mueller, R. and Page, R., Symbolic Computing with
Lisp and Prolog, New York: John Wiley & Sons, 1988.
[Muggleton, 1991] Muggleton, S., “Inductive Logic Programming,” New Gen
eration Computing, 8, pp. 295318, 1991.
[Muggleton, 1992] Muggleton, S., Inductive Logic Programming, London: Aca
demic Press, 1992.
[Muroga, 1971] Muroga, S., Threshold Logic and its Applications, New York:
Wiley, 1971.
[Natarjan, 1991] Natarajan, B., Machine Learning: A Theoretical Approach,
San Francisco: Morgan Kaufmann, 1991.
176 BIBLIOGRAPHY
[Nilsson, 1965] Nilsson, N. J., “Theoretical and Experimental Investigations in
Trainable PatternClassifying Systems,” Tech. Report No. RADCTR
65257, Final Report on Contract AF30(602)3448, Rome Air Develop
ment Center (Now Rome Laboratories), Griﬃss Air Force Base, New
York, September, 1965.
[Nilsson, 1990] Nilsson, N. J., The Mathematical Foundations of Learning Ma
chines, San Francisco: Morgan Kaufmann, 1990. (This book is a reprint
of Learning Machines: Foundations of Trainable PatternClassifying
Systems, New York: McGrawHill, 1965.)
[Oliver, Dowe, & Wallace, 1992] Oliver, J., Dowe, D., and Wallace, C., “Infer
ring Decision Graphs using the Minimum Message Length Principle,”
Proc. 1992 Australian Artiﬁcial Intelligence Conference, 1992.
[Pagallo & Haussler, 1990] Pagallo, G. and Haussler, D., “Boolean Feature Dis
covery in Empirical Learning,” Machine Learning, vol.5, no.1, pp. 7199,
March 1990.
[Pazzani & Kibler, 1992] Pazzani, M., and Kibler, D., “The Utility of Knowl
edge in Inductive Learning,” Machine Learning, 9, 5794, 1992.
[Peterson, 1961] Peterson, W., Error Correcting Codes, New York: John Wiley,
1961.
[Pomerleau, 1991] Pomerleau, D., “Rapidly Adapting Artiﬁcial Neural Net
works for Autonomous Navigation,” in Lippmann, P., et al. (eds.), Ad
vances in Neural Information Processing Systems, 3, pp. 429435, San
Francisco: Morgan Kaufmann, 1991.
[Pomerleau, 1993] Pomerleau, D, Neural Network Perception for Mobile Robot
Guidance, Boston: Kluwer Academic Publishers, 1993.
[Quinlan & Rivest, 1989] Quinlan, J. Ross, and Rivest, Ron, “Inferring Deci
sion Trees Using the Minimum Description Length Principle,” Informa
tion and Computation, 80:227–248, March, 1989.
[Quinlan, 1986] Quinlan, J. Ross, “Induction of Decision Trees,” Machine
Learning, 1:81–106, 1986. Reprinted in Shavlik, J. and Dietterich, T.,
Readings in Machine Learning, San Francisco: Morgan Kaufmann, 1990,
pp. 57–69.
[Quinlan, 1987] Quinlan, J. R., “Generating Production Rules from Decision
Trees,” In IJCAI87: Proceedings of the Tenth Intl. Joint Conf. on Ar
tiﬁcial Intelligence, pp. 3047, San Francisco: MorganKaufmann, 1987.
[Quinlan, 1990] Quinlan, J. R., “Learning Logical Deﬁnitions from Relations,”
Machine Learning, 5, 239266, 1990.
BIBLIOGRAPHY 177
[Quinlan, 1993] Quinlan, J. Ross, C4.5: Programs for Machine Learning, San
Francisco: Morgan Kaufmann, 1993.
[Quinlan, 1994] Quinlan, J. R., “Comparing Connectionist and Symbolic Learn
ing Methods,” in Hanson, S., Drastal, G., and Rivest, R., (eds.), Com
putational Learning Theory and Natural Learning Systems, Volume 1:
Constraints and Prospects, pp. 445456,, Cambridge, MA: MIT Press,
1994.
[Ridgway, 1962] Ridgway, W. C., An Adaptive Logic System with Generalizing
Properties, PhD thesis, Tech. Rep. 15561, Stanford Electronics Labs.,
Stanford, CA, April 1962.
[Rissanen, 1978] Rissanen, J., “Modeling by Shortest Data Description,” Auto
matica, 14:465471, 1978.
[Rivest, 1987] Rivest, R. L., “Learning Decision Lists,” Machine Learning, 2,
229246, 1987.
[Rosenblatt, 1958] Rosenblatt, F., Principles of Neurodynamics, Washington:
Spartan Books, 1961.
[Ross, 1983] Ross, S., Introduction to Stochastic Dynamic Programming, New
York: Academic Press, 1983.
[Rumelhart, Hinton, & Williams, 1986] Rumelhart, D. E., Hinton, G. E., and
Williams, R. J., “Learning Internal Representations by Error Propa
gation,” In Rumelhart, D. E., and McClelland, J. L., (eds.) Parallel
Distributed Processing, Vol 1, 318–362, 1986.
[Russell & Norvig 1995] Russell, S., and Norvig, P., Artiﬁcial Intelligence: A
Modern Approach, Englewood Cliﬀs, NJ: Prentice Hall, 1995.
[Samuel, 1959] Samuel, A., “Some Studies in Machine Learning Using the Game
of Checkers,” IBM Journal of Research and Development, 3:211229, July
1959.
[Schwartz, 1993] Schwartz, A., “A Reinforcement Learning Method for Max
imizing Undiscounted Rewards,” Proc. Tenth Intl. Conf. on Machine
Learning, pp. 298305, San Francisco: Morgan Kaufmann, 1993.
[Sejnowski, Koch, & Churchland, 1988] Sejnowski, T., Koch, C., and Church
land, P., “Computational Neuroscience,” Science, 241: 12991306, 1988.
[Shavlik, Mooney, & Towell, 1991] Shavlik, J., Mooney, R., and Towell, G.,
“Symbolic and Neural Learning Algorithms: An Experimental Compar
ison,” Machine Learning, 6, pp. 111143, 1991.
[Shavlik & Dietterich, 1990] Shavlik, J. and Dietterich, T., Readings in Ma
chine Learning, San Francisco: Morgan Kaufmann, 1990.
178 BIBLIOGRAPHY
[Sutton & Barto, 1987] Sutton, R. S., and Barto, A. G., “A Temporal
Diﬀerence Model of Classical Conditioning,” in Proceedings of the Ninth
Annual Conference of the Cognitive Science Society, Hillsdale, NJ: Erl
baum, 1987.
[Sutton, 1988] Sutton, R. S., “Learning to Predict by the Methods of Temporal
Diﬀerences,” Machine Learning 3: 944, 1988.
[Sutton, 1990] Sutton, R., “Integrated Architectures for Learning, Planning,
and Reacting Based on Approximating Dynamic Programming,” Proc. of
the Seventh Intl. Conf. on Machine Learning, pp. 216224, San Francisco:
Morgan Kaufmann, 1990.
[Taylor, Michie, & Spiegalhalter, 1994] Taylor, C., Michie, D., and Spiegal
halter, D., Machine Learning, Neural and Statistical Classiﬁcation,
Paramount Publishing International.
[Tesauro, 1992] Tesauro, G., “Practical Issues in Temporal Diﬀerence Learn
ing,” Machine Learning, 8, nos. 3/4, pp. 257277, 1992.
[Towell & Shavlik, 1992] Towell G., and Shavlik, J., “Interpretation of Artiﬁ
cial Neural Networks: Mapping KnowledgeBased Neural Networks into
Rules,” in Moody, J., Hanson, S., and Lippmann, R., (eds.), Advances in
Neural Information Processing Systems, 4, pp. 977984, San Francisco:
Morgan Kaufmann, 1992.
[Towell, Shavlik, & Noordweier, 1990] Towell, G., Shavlik, J., and Noordweier,
M., “Reﬁnement of Approximate Domain Theories by KnowledgeBased
Artiﬁcial Neural Networks,” Proc. Eighth Natl., Conf. on Artiﬁcial In
telligence, pp. 861866, 1990.
[Unger, 1989] Unger, S., The Essence of Logic Circuits, Englewood Cliﬀs, NJ:
PrenticeHall, 1989.
[Utgoﬀ, 1989] Utgoﬀ, P., “Incremental Induction of Decision Trees,” Machine
Learning, 4:161–186, Nov., 1989.
[Valiant, 1984] Valiant, L., “A Theory of the Learnable,” Communications of
the ACM, Vol. 27, pp. 11341142, 1984.
[Vapnik & Chervonenkis, 1971] Vapnik, V., and Chervonenkis, A., “On the
Uniform Convergence of Relative Frequencies, Theory of Probability and
its Applications, Vol. 16, No. 2, pp. 264280, 1971.
[Various Editors, 19891994] Advances in Neural Information Processing Sys
tems, vols 1 through 6, San Francisco: Morgan Kaufmann, 1989 1994.
[Watkins & Dayan, 1992] Watkins, C. J. C. H., and Dayan, P., “Technical Note:
QLearning,” Machine Learning, 8, 279292, 1992.
BIBLIOGRAPHY 179
[Watkins, 1989] Watkins, C. J. C. H., Learning From Delayed Rewards, PhD
Thesis, University of Cambridge, England, 1989.
[Weiss & Kulikowski, 1991] Weiss, S., and Kulikowski, C., Computer Systems
that Learn, San Francisco: Morgan Kaufmann, 1991.
[Werbos, 1974] Werbos, P., Beyond Regression: New Tools for Prediction and
Analysis in the Behavioral Sciences, Ph.D. Thesis, Harvard University,
1974.
[Widrow & Lehr, 1990] Widrow, B., and Lehr, M. A., “30 Years of Adaptive
Neural Networks: Perceptron, Madaline and Backpropagation,” Proc.
IEEE, vol. 78, no. 9, pp. 14151442, September, 1990.
[Widrow & Stearns, 1985] Widrow, B., and Stearns, S., Adaptive Signal Pro
cessing, Englewood Cliﬀs, NJ: PrenticeHall.
[Widrow, 1962] Widrow, B., “Generalization and Storage in Networks of Ada
line Neurons,” in Yovits, Jacobi, and Goldstein (eds.), Selforganizing
Systems—1962, pp. 435461, Washington, DC: Spartan Books, 1962.
[Winder, 1961] Winder, R., “Single Stage Threshold Logic,” Proc. of the AIEE
Symp. on Switching Circuits and Logical Design, Conf. paper CP60
1261, pp. 321332, 1961.
[Winder, 1962] Winder, R., Threshold Logic, PhD Dissertation, Princeton Uni
versity, Princeton, NJ, 1962.
[Wnek, et al., 1990] Wnek, J., et al., “Comparing Learning Paradigms via Di
agrammatic Visualization,” in Proc. Fifth Intl. Symp. on Methodologies
for Intelligent Systems, pp. 428437, 1990. (Also Tech. Report MLI902,
University of Illinois at UrbanaChampaign.)
ii
Contents
1 Preliminaries 1.1 Introduction . . . . . . . . . . . . . . . . 1.1.1 What is Machine Learning? . . . 1.1.2 Wellsprings of Machine Learning 1.1.3 Varieties of Machine Learning . . 1.2 Learning InputOutput Functions . . . . 1.2.1 Types of Learning . . . . . . . . 1.2.2 Input Vectors . . . . . . . . . . . 1.2.3 Outputs . . . . . . . . . . . . . . 1.2.4 Training Regimes . . . . . . . . . 1.2.5 Noise . . . . . . . . . . . . . . . 1.2.6 Performance Evaluation . . . . . 1.3 Learning Requires Bias . . . . . . . . . . 1.4 Sample Applications . . . . . . . . . . . 1.5 Sources . . . . . . . . . . . . . . . . . . 1.6 Bibliographical and Historical Remarks 2 Boolean Functions 2.1 Representation . . . . . . . . . . . . . . 2.1.1 Boolean Algebra . . . . . . . . . 2.1.2 Diagrammatic Representations . 2.2 Classes of Boolean Functions . . . . . . 2.2.1 Terms and Clauses . . . . . . . . 2.2.2 DNF Functions . . . . . . . . . . 2.2.3 CNF Functions . . . . . . . . . . 2.2.4 Decision Lists . . . . . . . . . . . 2.2.5 Symmetric and Voting Functions 2.2.6 Linearly Separable Functions . . 2.3 Summary . . . . . . . . . . . . . . . . . 2.4 Bibliographical and Historical Remarks iii 1 1 1 3 4 5 5 7 8 8 9 9 9 11 13 13 15 15 15 16 17 17 18 21 22 23 23 24 25
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . .5 Synergies Between Neural Network and KnowledgeBased Methods 4. . . . . . .2 Special Cases of Linearly Separable Functions . . . . . . . . . . . . . . .4. . . . 5. . . . . .3 Networks of TLUs . . . . . . . . . . . . .4 Weight Space . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . 4. . . .4. . . 4. . . . . . 4. . . . .4. . . . 4. . . . . . .3 NearestNeighbor Methods . . . . . . . . . . . . . . . . . . 4.3. . . . . . . . . . . . . . . . . . . . . . . .3 Conditionally Independent Binary Components 5. . . . . . .3 Piecewise Linear Machines . . . . .3 Computing Weight Changes in the Final Layer . . .2 Version Graphs . . . . . . . . . . . .1 Motivation and Examples . .2 The Backpropagation Method . . . . . . . . . . . . . . . .1 Version Spaces and Mistake Bounds . . . .1 Threshold Logic Units . . . . .3. . .1. .1. . . .5 Bibliographical and Historical Remarks . . . . .6 Training a TLU on NonLinearlySeparable Training Sets 4. . . . . . .2 Linear Machines . . .1 Using Statistical Decision Theory . . . . . . .1. . . . . . . . . . . . . . . .4 Computing Changes to the Weights in Intermediate Layers 4. . . . . .4. . . . . . . . .5 Variations on Backprop . . . . . . . . . . . . . . . . . . . 4. . .1 Deﬁnitions and Geometry . . . . . . . . . . 4. . . . . . . . .6 An Application: Steering a Van . iv . .3. . . . . . . . . . . 5. . . . .4 Cascade Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. 4. 4. . . . . .1. . 3. .6 Bibliographical and Historical Remarks .1 Notation . . 3. . . . . 4. . . .3 ErrorCorrection Training of a TLU . . . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . .1. . . . 5 Statistical Learning 5.2 Gaussian (or Normal) Distributions . . .3 Using Version Spaces for Learning 3.1. . . . . .4 Bibliographical and Historical Remarks . . .1. . . . . . . . . . . . . . . 3. .4. . 4. . 4. . .3 Learning as Search of a Version Space . . . . . . . 4. . . . 5. . . . . . . 4. . . . . . . . . . . . . . . . . . . 3. . .3. . . . . . . . . . . . . . 5. 4. . . . . . . . . . . . . . . . . . . . 4. . . . . . .1 Background and General Method . 5. .1. . . . . . . 27 27 29 32 32 34 35 35 35 37 38 40 42 44 44 46 46 49 50 51 52 52 53 56 58 59 60 61 61 63 63 63 65 68 70 70 72 4 Neural Networks 4. . . .4. . . . . . . . . .4 The Candidate Elimination Method . . . . . . . .2 Learning Belief Networks .4 Training Feedforward Networks by Backpropagation . . . . . . . .2 Madalines .5 The WidrowHoﬀ Procedure . . . . . . . .
. . . . .4. . . . 8. . . . . . . . .3 The VapnikChervonenkis Dimension . . .5 Bibliographical and Historical Remarks . . . . . . . . . . . . .7 Comparisons . . 6. . . .1 Notation and Assumptions for PAC Learning Theory . .7 Bibliographical and Historical Remarks . .3 An Example . . . . . 6. . . .1 Deﬁnitions . .4. . . . .1 Notation and Deﬁnitions . . . . . 6. . . . . 7 Inductive Logic Programming 7. . .3. . . . . . . . . . . . . . .4 VC Dimension and PAC Learning . . . . . . . . . . 100 . . . . . . . . . 101 . . . . 94 .2. . . . . . . . . . . .2 Supervised Learning of Univariate Decision Trees .3 Avoiding Overﬁtting in Decision Trees . . .8 Bibliographical and Historical Remarks . . .2 Capacity . . . . . . . . . .1 Overﬁtting . 6. .6 The Problem of Missing Attributes . . 7. . . . . . . . . . . . . . . . . . . . . . . . .1 Linear Dichotomies . . . 89 . . . . . . . . . . . . 73 73 74 75 75 79 79 80 80 81 82 83 84 84 86 86 87 . 104 107 107 109 109 111 112 113 113 115 116 117 118 118 8 Computational Learning Theory 8. . . .2. . .2. . .5 The Problem of Replicated Subtrees . . . . . . . .2. . . . . 7. .3 Some Properly PACLearnable Classes . . . . . 8. . . . . . . . . . . . . . 7. . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . .3. 8. . . . . . . 6. . . . . . . . 6. v . . . .3. . . . .4. . . 8. . . . . . . . . . . .4 Inducing Recursive Programs . . . . . . . . . 8. . . .2. . . . . . . . . .2 PAC Learning . . . . . . .2 Validation Methods . . . . . . . . . . .3 NonBinary Attributes . . 90 . . . . . . . . . .4 Some Facts and Speculations About the VC Dimension 8. . 6. . . . . . . 7. . 8. . . . . . .3. . . 6. . . . . . . 6. . . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . 98 . . . . .3 Networks Equivalent to Decision Trees . . . . .4. . . . . .6 Relationships Between ILP and Decision 7. . . . . . . . . . . . . . .2 Examples . . . . . . . . . . .2 A Generic ILP Algorithm . . . . . . 8. . . . .1 The Fundamental Theorem . . . . . . . . . 91 . . . . . . . . . . . 6. . .2. . . . . . . . . . . . . .6 Decision Trees 6. . . . . . . . . . . .5 Noise in Data . . . . . 8. . . . . . . .5 Choosing Literals to Add . . . . . . . . . . . . .3 A More General Capacity Result . .4 MinimumDescription Length Methods . . . . .4. . . . . . . . . . . 7. . . . . . . . . . 8. . . 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Using Uncertainty Reduction to Select Tests 6. . . . . . . . . . . . . . Induction . . . . . . . . . . . . Tree . . 6. . . . . . . . . . . . .1 Selecting the Type of Test .4 Overﬁtting and Evaluation . . . . . . 6. . . . . . . . . . . . . . . . .
. . . . . . .3 Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .5. .5 Scaling Problems . 135 10. . . . . . . . . 150 11. . . . . . . . . . . . . .5. .1 9. . . . . . . .5. . 144 11. . . . 125 A Method Based on Probabilities . . . . . . 150 11. . . . . .3. . . .6 Bibliographical and Historical Remarks . . 152 11. . .5 Theoretical Results . . . . . . . . . . . . . . . . . . . . .1 9. . . . 145 11. . 134 10. . . . . . .2. . . . . . . . . . . . . . . . . . . . .2 An Example . . . . . . . . . . . . . . 154 11. . . . . . . . . . . . . . . . . .3 Temporal Discounting and Optimal Policies . . . .6 IntraSequence Weight Updating .2. . . . . . . . . . . .3 Incremental Computation of the (∆W)i . . .4 An Experiment with TD Methods . . .1 Temporal Patterns and Prediction Problems .2 A Method Based on Euclidean Distance . . . 138 10. .5. . . . 138 10.4 Bibliographical and Historical Remarks . . . 141 11 DelayedReinforcement Learning 143 11. . . . . . . 147 11. . . . . . . . .1 The General Problem . . . . . . . .7 An Example Application: TDgammon . . . . . . . . . . . . . . . . . . . Limitations. 155 vi . . . . . . 124 9. . . . . . . . . . . . . . . . . . 120 A Method Based on Probabilities . 120 9.3.9 Unsupervised Learning 9. 131 10. . . . . . . . . . . . . . . . 125 9. . . . . . . . . . . . . . . . . . .2 A Method Based on Euclidean Distance . 143 11.5 Discussion. . .8 Bibliographical and Historical Remarks . . . 140 10. . . . . . . . .3 Generalizing Over Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Using Random Actions . . . . . . . .2 119 What is Unsupervised Learning? . . . . . .2 Supervised and TemporalDiﬀerence Methods . . . . . . . . . . . . 154 11. . . . . . . . . . . . . . . . . . . . 131 10. . . . . . .1 9. . . . . . . . . . . . .1 An Illustrative Example .4 QLearning . . . . . and Extensions of QLearning . . . . 126 9. . . . . . . . . 130 131 10 TemporalDiﬀerence Learning 10. . .5. . . . . . . . . . . . . . . . . . . . . 153 11. . . . . . 119 Clustering Methods . . .4 Partially Observable States . . . . . . . . .
. . . 162 12. . . . . . .7 Applications .2 Domain Theories . . . . . . . . . . . . . . . . .1 MacroOperators in Planning . . . . .1 Deductive Learning . . . . . . . . . . . . . . .12 ExplanationBased Learning 157 12. . . . . . . . . 157 12. . . . . . 164 12. . . . . . . . . 159 12. . . . .5 More General Proofs . . . . . 167 12. . . . . . . . . . . . . . 164 12. . . . . . .6 Utility of EBL . . . . . . . . . . . . . . . . . . . . . . .2 Learning Search Control Knowledge . . .3 An Example . . .7. . . . . . 168 vii . . .7. . . . . . . . . . . . . . . . . . . . . . . . 164 12. . .8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Evaluable Predicates . . . . . . . . . . . . 158 12. . . . 164 12. . . . .
viii .
and Bayes networks . . I do not give proofs of many of the theorems that I state. and suggestions from students and other readers. the book is not a handbook of machine learning practice.Preface These notes are in the process of becoming a textbook. Elman nets and other recurrent nets. Many typographical infelicities will no doubt persist until the ﬁnal version. but I do give plausibility arguments and citations to formal proofs.. as have my colleague. some undoubtedly remain—caveat lector. Instead. Ron Kohavi. criticisms. my goal is to give the reader suﬃcient preparation to make the extensive literature on machine learning accessible. Some of my plans for additions and other reminders are mentioned in marginal notes. . ix . Pat Langley. genetic algorithms. I am also collecting exercises and project suggestions which will appear in future versions. Karl Pﬂeger. The book concentrates on the important ideas in machine learning. And. My intention is to pursue a middle ground between a theoretical textbook and one that focusses on applications. More material has yet to be added. and Lise Getoor. and my teaching assistants. grammar and automata learning. Robert Allen. Please let me have your suggestions about topics that are too important to be left out. I do not treat many matters that would be of practical importance in applications. radial basis functions. Students in my Stanford courses on machine learning have already made several useful suggestions. and the author solicits corrections. Although I have tried to eliminate errors. I hope that future versions will cover Hopﬁeld nets. The process is quite unﬁnished.
planning. program. Some of these changes. To be slightly more speciﬁc. such as the addition of a record to a data base.1 1. The “changes” might be either enhancements to already performing systems or ab initio synthesis of new systems. Such tasks involve recognition. very broadly. or experience. by study. Machine learning usually refers to the changes in systems that perform tasks associated with artiﬁcial intelligence (AI). like intelligence. A dictionary deﬁnition includes phrases such as “to gain knowledge. diagnosis. or skill in. In this book we focus on learning in machines. or data (based on its inputs or in response to external information) in such a manner that its expected future performance improves. It seems likely also that the concepts and techniques being explored by researchers in machine learning may illuminate certain aspects of biological learning. we might say.” and “modiﬁcation of a behavioral tendency by experience. There are several parallels between animal and machine learning.1 Introduction What is Machine Learning? Learning. Certainly. for example. prediction. we show the architecture of a typical AI 1 . when the performance of a speechrecognition machine improves after hearing several samples of a person’s speech. But. we feel quite justiﬁed in that case to say that the machine has learned.1.Chapter 1 Preliminaries 1.” Zoologists and psychologists study learning in animals and humans. instruction. robot control. As regards machines. that a machine learns whenever it changes its structure. etc. fall comfortably within the province of other disciplines and are not necessarily better understood for being called learning. or understanding of. many techniques in machine learning derive from the eﬀorts of psychologists to make more precise their theories of animal and human learning through computational models. covers such a broad range of processes that it is difﬁcult to deﬁne precisely.
We would like machines to be able to adjust their internal structure to produce correct outputs for a large number of sample inputs and thus suitably constrain their input/output function to approximate the relationship implicit in the examples. Diﬀerent learning mechanisms might be employed depending on which subsystem is being changed. 1. This agent perceives and models its environment and computes appropriate actions. Some of these are: • Some tasks cannot be deﬁned well except by example. PRELIMINARIES “agent” in Fig. Of course. we might be able to specify input/output pairs but not a concise relationship between inputs and desired outputs. • It is possible that hidden among large piles of data are important relationships and correlations.2 CHAPTER 1. Sensory signals Goals Perception Model Planning and Reasoning Action Computation Actions Figure 1. perhaps by anticipating their eﬀects.1. . Machine learning methods can often be used to extract these relationships (data mining). Changes made to any of the components shown in the ﬁgure might count as learning. But there are important engineering reasons as well. we have already mentioned that the achievement of learning in machines might help us understand how animals and humans learn.1: An AI System One might ask “Why should machines have to learn? Why not design machines to perform as desired in the ﬁrst place?” There are several reasons why machine learning is important. that is. We will study several diﬀerent learning methods in this book.
1943. 1. Statistical methods for dealing with these problems can be considered instances of machine learning because the decision and estimation rules depend on a corpus of samples drawn from the problem environment. more recently by [Gluck & Rumelhart. A related problem is how to estimate the value of an unknown function at a new point given the values of this function at a set of sample points. Hebb. & Churchland. • New knowledge about tasks is constantly being discovered by humans.1. 1988]. Rosenblatt. In fact. • Brain Models: Nonlinear elements with weighted inputs have been suggested as simple models of biological neurons. 1958].1.2 Wellsprings of Machine Learning Work in machine learning is now converging from several sources. Vocabulary changes. These different traditions each bring diﬀerent methods and diﬀerent vocabulary which are now being assimilated into a more uniﬁed discipline. Brain modelers are interested in how closely these networks approximate the learning phenomena of . Continuing redesign of AI systems to conform to new knowledge is impractical. • Environments change over time. Sejnowski. more details will follow in the the appropriate chapters: • Statistics: A longstanding problem in statistics is how best to use samples drawn from unknown probability distributions to help decide from which distribution some new sample is drawn. 1989. Machines that learn this knowledge gradually might be able to capture more of it than humans would want to write down.1. but machine learning methods might be able to track much of it. Machine learning methods can be used for onthejob improvement of existing machine designs. We will explore some of the statistical methods later in the book. There is a constant stream of new events in the world. Here is a brief listing of some of the separate disciplines that have contributed to machine learning. INTRODUCTION 3 • Human designers often produce machines that do not work as well as desired in the environments in which they are used. • The amount of knowledge available about certain tasks might be too large for explicit encoding by humans. certain characteristics of the working environment might not be completely known at design time. Koch. Details about the statistical theory underlying these methods can be found in statistical textbooks such as [Anderson. 1958] and. Networks of these elements have been studied by several researchers including [McCulloch & Pitts. Machines that can adapt to a changing environment would reduce the need for constant redesign. 1949.
AI research has been concerned with machine learning. Recent work has been directed at discovering rules for expert systems using decisiontree methods [Quinlan. 1988. 1992. Some of the work in reinforcement learning can be traced to eﬀorts to model how reward stimuli inﬂuence the learning of goalseeking behavior in animals [Sutton & Barto. Since the distinction between evolving and learning can be blurred in computer systems. Lavraˇ & Dˇeroski. Etzioni. 1986. 1994]. Minton. Work connectionism. et al. 1988]. . We shall techniques are based on neural networks. 1990] and inductive logic programming [Muggleton. Koza. 1986. Often. An early example is the EPAM network for storing and retrieving one member of a pair of words when given another [Feigenbaum. For an introduction see [Bollinger & Duﬃe. 1959]. 1993]. & Stone. Some aspects of controlling a robot based on sensory inputs represent instances of this sort of problem. • Evolutionary Models: In nature. c z Another theme has been saving and generalizing the results of problem solving using explanationbased learning [DeJong & Mooney. or subsymbolic processing. PRELIMINARIES see that several important machine learning networks of nonlinear elements—often called inspired by this school is sometimes called computation. not only do individual animals learn to perform better. AI researchers have also explored the role of analogies in learning [Carbonell. 1973] methods. 1991. Related work led to a number of early decision tree [Hunt. Genetic algorithms [Holland. 1975] and genetic programming [Koza. 1961].. Marin. 1994] are the most prominent computational techniques for evolution. and the control process must track these changes. More recent work of this sort has been inﬂuenced by activities in artiﬁcial intelligence which we will be presenting. brainstyle CHAPTER 1. but species evolve to be better ﬁt in their individual niches. 1993]. 1966] and semantic network [Anderson & Bower.4 living brains. Reinforcement learning is an important theme in machine learning research. 1983] and how future actions and decisions can be based on previous exemplary cases [Kolodner. • Adaptive Control Theory: Control theorists study the problem of controlling a process having unknown parameters which must be estimated during operation. • Psychological Models: Psychologists have studied the performance of humans in various learning tasks. techniques that model certain aspects of biological evolution have been proposed as learning methods to improve the performance of computer programs. • Artiﬁcial Intelligence: From the beginning. Samuel developed a prominent early program that learned parameters of a function for evaluating board positions in the game of checkers [Samuel. 1987]. the parameters change during operation. Laird.
1. h. we know (sometimes only approximately) the values of f for the m samples in the training set. Ξ. Ξ.1 Types of Learning There are two major settings in which we wish to learn a function. Sometimes we know that f also belongs to this class or to a subset of this class. and we turn to that matter ﬁrst. is selected from a class of functions H. Our hypothesis about the function to be learned is denoted by h. We select h based on a training set. We think of h as being implemented by a device that has X as input and h(X) as output. the change to the existing structure might be simply to make it more computationally eﬃcient rather than to increase the coverage of the situations it can handle. xn ) which has n components. In one. 1. . of m input vector examples. In the latter case. Both f and h themselves may be vectorvalued. Much of the terminology that we shall be using throughout the book is best introduced by discussing the problem of learning functions. . f . and the task of the learner is to guess what it is. x2 . h. 1. . 1. that closely agrees with f for the members of Ξ. We assume that if we can ﬁnd a hypothesis. xi .2. .2 to help deﬁne some of the terminology used in describing the problem of learning a function. . .3 Varieties of Machine Learning Orthogonal to the question of the historical source of any learning technique is the more important question of what is to be learned. Many important details depend on the nature of the assumptions made about all of these entities.2. LEARNING INPUTOUTPUT FUNCTIONS 5 1. called supervised learning. . We assume a priori that the hypothesized function. . Imagine that there is a function.2 Learning InputOutput Functions We use Fig. Both f and h are functions of a vectorvalued input X = (x1 . We will consider a variety of diﬀerent computational structures: • Functions • Logic programs and rule sets • Finitestate machines • Grammars • Problem solving systems We will present methods both for the synthesis of these structures from examples and for changing existing structures. In this book.1. then this hypothesis will be a good guess for f —especially if Ξ is large. . we take it that the thing to be learned is a computational structure of some sort.
Xm} x1 . 1. (We can still regard the problem as one of learning a function. h. the value of the function is the name of the subset to which an input vector belongs. . PRELIMINARIES Training Set: = {X1.. X2. In the other setting. In this case.) Unsupervised learning methods have application in taxonomic problems in which it is desired to invent ways to classify data into meaningful categories. is to partition the training set into subsets. The problem in this case. . f . . typically. x2 plane that ﬁts the points. From this deductive process. . ΞR . This parabolic function. A very simple example of speedup learning involves deduction processes. but we need not have required exact matches. .6 CHAPTER 1. we simply have a training set of vectors without function values for them. h. that produced the four samples.3.2: An InputOutput Function Curveﬁtting is a simple example of supervised learning of a function. h. of seconddegree functions. We want to ﬁt these four points with a function. . or to modify an existing one. We shall also describe methods that are intermediate between supervised and unsupervised learning. h = f at the four samples. Ξ1 . . at the four sample points shown by the solid circles in Fig. From the formulas A ⊃ B and B ⊃ C. we can create the formula A ⊃ C—a new formula but one that does not sanction any more con . This type of learning is sometimes called speedup learning. We show there a twodimensional parabolic surface above the x1 . . . H. xi . xn X= h(X) h h H Figure 1. f . . An interesting special case is that of changing an existing function into an equivalent one that is computationally more eﬃcient. . . Suppose we are given the values of a twodimensional function. . in some appropriate way. drawn from the set. . is our hypothesis about the function. Xi. we can deduce C if we are given A. termed unsupervised learning. We might either be trying to ﬁnd a new function.
1.3: A Surface that Fits Four Points clusions than those that could be derived from the formulas that we previously had. input variables. pattern vector. The components. information about a student might be represented by the values of the attributes class. higgins). than we could have done before. 1.2 Input Vectors Because machine learning methods derive from so many diﬀerent traditions. the input vector is called by a variety of names. feature vector. In all cases. . male. As an example illustrating categorical values. it is possible to represent the input in unordered form by listing the names of the attributes together with their values. They might be realvalued numbers. For example. attributes.2. We say that the latter methods involve inductive learning. Some of these are: input vector. we might have: (major: history. major. example. LEARNING INPUTOUTPUT FUNCTIONS 7 h sample fvalue 1500 0 1000 00 500 00 0 10 5 0 5 10 10 5 10 5 0 x2 x1 Figure 1. Of course. xi . As an example of an attributevalue representation. As opposed to deduction. The vector form assumes that the attributes are ordered and given implicitly by a form. and components. But with this new formula we can derive C more quickly. there are no correct inductions—only useful ones. medium. large}) or unordered (as in the example just given). given A. its terminology is rife with synonyms. Additionally. sex. The values of the components can be of three main types. discretevalued numbers. We can contrast speedup learning with methods that create genuinely new functions—ones that might give diﬀerent results after learning than they did before. and instance. adviser. sample. or categorical values.2. categorical values may be ordered (as in {small. and we will be using most of them in this book. history. mixtures of all these types of values are possible. A particular student would then be represented by a vector such as: (sophomore. sex: male. of the input vector are variously called features.
can be used to produce a hypothesized function. age: 19).2. adviser: higgins. 1. By contrast. and the output itself is called a label. a recognizer. h. we select one member at a time from the training set and use this instance alone to modify a current hypothesis. We will be using the vector form exclusively. 1. and so on. The input in that case is some suitable representation of the printed character. the entire training set is available and used all at once to compute the function. a training pattern having value 1 is called a positive instance. The selection method can be random (with replacement) or it can cycle through the training set iteratively.) Using the training set members as they become available is called an online method. the classiﬁer implements a Boolean function. In that case. and the function is called a concept. If the entire training set becomes available one member at a time. say. Online methods might be used. in which case the process embodying h is variously called a classiﬁer. Vectorvalued outputs are also possible with components being real numbers or categorical values. h. In the batch method. When the input is also Boolean. and the classiﬁer maps this input into one of. Then another member of the training set is selected. and the output is called an output value or estimate. An important special case is that of Boolean output values. Ξ. (Alternatively. in the incremental method. 64 categories. PRELIMINARIES class: sophomore. which can be regarded as a special case of either discrete numbers (1. A variation of this method uses the entire training set to modify a current hypothesis iteratively until an acceptable hypothesis is obtained. the output may be a categorical value.8 CHAPTER 1. in which case the process embodying the function.4 Training Regimes There are several ways in which the training set. Learning a Boolean function is sometimes called concept learning. An important specialization uses Boolean values.2. at any stage all training set members so far available could be used in a “batch” process. a class. then we might also use an incremental method—selecting and using training set members as they arrive. False). Alternatively. Classiﬁers have application in a number of recognition problems. for example in the recognition of handprinted characters. a category. or a decision.3 Outputs The output may be a real number. for example.0) or of categorical variables (True. and a training sample having value 0 is called a negative instance. We study the Boolean case in some detail because it allows us to make important general points in a simpliﬁed setting. when the . is called a function estimator. or a categorizer.
1. Why would a learning procedure happen to select the quadratic one shown in that ﬁgure? In order to make that selection we had at least to limit a priori the set of hypotheses to quadratic functions and then to insist that the one we chose passed through all four sample points. At any stage of the process. after being presented with one member of the training set and its value we can rule out precisely onehalf of the members of H—those Boolean functions that would misclassify this labeled sample. A hypothesized function is said to generalize when it guesses well on the testing set. The next set of sensory inputs will depend on which action was selected. LEARNING REQUIRES BIAS 9 next training instance is some function of the current hypothesis and the previous instance—as it would be when a classiﬁer is used to decide on a robot’s next action given its current set of sensory inputs. This kind of a priori information is called bias. In this case.5 Noise Sometimes the vectors in the training set are corrupted by noise. and we have no preference among those that ﬁt the samples in the training set. As we present more members of the training set.1. Suppose we had no bias. it is important to have methods to evaluate the result of learning. it would be inappropriate to insist that the hypothesized function agree precisely with the values of the samples in the training set. and useful learning without bias is impossible. but. We will discuss this matter in more detail later. there are an uncountable number of diﬀerent functions having values that agree with the four samples shown in Fig. in supervised learning the induced function is usually evaluated on a separate set of inputs and function values for them called the testing set . that is H is the set of all 22 Boolean functions. brieﬂy. Class noise randomly alters the value of the function.3 Learning Requires Bias Long before now the reader has undoubtedly asked why is learning a function possible at all? Certainly.3.2. 1. for example. Both meansquarederror and the total number of errors are common measures. the graph of the number of hypotheses not yet ruled out as a function of the number of diﬀerent patterns presented is as shown in Fig. We can gain more insight into the role of bias by considering the special case of learning a Boolean function of n dimensions.2. 1. 1. 1. attribute noise randomly alters the values of the components of the input vector. In either case.3.4. . There are two kinds of noise.6 Performance Evaluation Even though there is no correct answer in inductive learning. The remaining functions constitute what is called a “version space. There are 2n diﬀerent Boolean n inputs possible.” we’ll explore that concept in more detail later.
Only memorization is possible here. Let’s look at a speciﬁc example of how bias aids learning.10 CHAPTER 1. then there is only one function remaining in this hypothsis set that is consistent with the training set. We’ll examine that theory later.4: Hypotheses Remaining as a Function of Labeled Patterns Presented But suppose we limited H to some subset. Depending on the subset and on the order of presentation of training patterns.5. of functions not ruled out log2Hv 2n 2n j (generalization is not possible) 0 0 2n j = no.6. . A Boolean function can be represented by a hypercube each of whose vertices represents a diﬀerent input pattern. No generalization is possible in this case because the training patterns give no clue about the value of a pattern not yet seen. Certainly. 1. We show a 3dimensional version in Fig. If the hypothesis set consists of just the linearly separable functions—those for which the positive and negative instances can be separated by a linear surface. which is a trivial sort of learning. we can already pin down what the function must be—given the bias. in this case. of labeled patterns already seen Figure 1. In this case it is even possible that after seeing fewer than all 2n labeled samples. 1. most of them may have the same value for most of the patterns not yet seen! The theory of Probably Approximately Correct (PAC) learning makes this intuitive idea precise. There. a curve of hypotheses not yet ruled out might look something like the one shown in Fig. even if there is more than one hypothesis remaining. even though the training set does not contain all possible patterns. there might be only one hypothesis that agrees with the training set. PRELIMINARIES half of the remaining Boolean functions have value 1 and half have value 0 for any training pattern not yet seen. Hc . we show a training set of six sample patterns and have marked those having a value of 1 by a small square and those having a value of 0 by a small circle. of all Boolean functions. So. Hv = no.
In our example of Fig.6. The principle of Occam’s razor. (William of Occam. In preference bias. of labeled patterns already seen Figure 1. Nevertheless. Rule discovery using a variant of ID3 for a printing industry problem . is a type of preference bias.4 Sample Applications Our main emphasis in this book is on the concepts of machine learning—not on its applications.5: Hypotheses Remaining From a Restricted Subset Machine learning researchers have identiﬁed two main varieties of bias. For example. if these concepts were irrelevant to realworld problems they would probably not be of much interest. 1285?1349. 1. one selects that hypothesis that is minimal according to some ordering scheme over all hypotheses. absolute and preference. 1992] cites some of the following applications and others: a. SAMPLE APPLICATIONS 11 Hv = no. if we had some way of measuring the complexity of a hypothesis.” which means “entities should not be multiplied unnecessarily. we give a short summary of some areas in which machine learning techniques have been successfully applied. of functions not ruled out log2Hv 2n depends on order of presentation log2Hc 0 0 2n j = no. we might select the one that was simplest among those that performed satisfactorily on the training set. one restricts H to a deﬁnite subset of functions. As motivation. the restriction was to linearly separable Boolean functions.1. was an English philosopher who said: “non sunt multiplicanda entia praeter necessitatem. In absolute bias (also called restricted hypothesisspace bias). used in science to prefer simple explanations to more complex ones. [Langley.4.”) 1.
Sharp’s Japanese kanji character recognition system processes 200 characters per second with 99+% accuracy. K. Classiﬁcation of stars and galaxies [Fayyad. 1993] mentions: a. commodity trading. e. Planning and scheduling for a steel mill using ExpertEase. d.3%. Many applicationoriented papers are presented at the annual conferences on Neural Information Processing Systems. b. NeuroForecasting Centre’s (London Business School and University College London) trading strategy selection network earned an average annual proﬁt of 18% against a conventional system’s 12. bioengineering. image processing. et al. face recognition.12 CHAPTER 1. diagnosis. 1992]. . a marketed (ID3like) system [Michie. Electric power load forecasting using a knearestneighbor rule system [Jabbour.. 1992]...6: A Training Set That Completely Determines a Linearly Separable Function [Evans & Fisher. b. 1993]. Among these are papers on: speech recognition. optical character recognition. music composition. It recognizes 3000+ characters. 1992]. c. As additional examples. PRELIMINARIES x3 x2 x1 Figure 1. Automatic “help desk” assistant using a nearestneighbor system [Acorn & Walden. [Hammerstrom. et al. 19891994]. dolphin echo recognition. and various control applications [Various Editors. 1987].
1.5. SOURCES
13
c. Fujitsu’s (plus a partner’s) neural network for monitoring a continuous steel casting operation has been in successful operation since early 1990. In summary, it is rather easy nowadays to ﬁnd applications of machine learning techniques. This fact should come as no surprise inasmuch as many machine learning techniques can be viewed as extensions of well known statistical methods which have been successfully applied for many years.
1.5
Sources
Besides the rich literature in machine learning (a small part of which is referenced in the Bibliography), there are several textbooks that are worth mentioning [Hertz, Krogh, & Palmer, 1991, Weiss & Kulikowski, 1991, Natarjan, 1991, Fu, 1994, Langley, 1996]. [Shavlik & Dietterich, 1990, Buchanan & Wilkins, 1993] are edited volumes containing some of the most important papers. A survey paper by [Dietterich, 1990] gives a good overview of many important topics. There are also well established conferences and publications where papers are given and appear including: • The Annual Conferences on Advances in Neural Information Processing Systems • The Annual Workshops on Computational Learning Theory • The Annual International Workshops on Machine Learning • The Annual International Conferences on Genetic Algorithms (The Proceedings of the abovelisted four conferences are published by Morgan Kaufmann.) • The journal Machine Learning (published by Kluwer Academic Publishers). There is also much information, as well as programs and datasets, available over the Internet through the World Wide Web.
1.6
Bibliographical and Historical Remarks
To be added. Every chapter will contain a brief survey of the history of the material covered in that chapter.
14
CHAPTER 1. PRELIMINARIES
Chapter 2
Boolean Functions
2.1
2.1.1
Representation
Boolean Algebra
Many important ideas about learning of functions are most easily presented using the special case of Boolean functions. There are several important subclasses of Boolean functions that are used as hypothesis classes for function learning. Therefore, we digress in this chapter to present a review of Boolean functions and their properties. (For a more thorough treatment see, for example, [Unger, 1989].) A Boolean function, f (x1 , x2 , . . . , xn ) maps an ntuple of (0,1) values to {0, 1}. Boolean algebra is a convenient notation for representing Boolean functions. Boolean algebra uses the connectives ·, +, and . For example, the and function of two variables is written x1 · x2 . By convention, the connective, “·” is usually suppressed, and the and function is written x1 x2 . x1 x2 has value 1 if and only if both x1 and x2 have value 1; if either x1 or x2 has value 0, x1 x2 has value 0. The (inclusive) or function of two variables is written x1 + x2 . x1 + x2 has value 1 if and only if either or both of x1 or x2 has value 1; if both x1 and x2 have value 0, x1 + x2 has value 0. The complement or negation of a variable, x, is written x. x has value 1 if and only if x has value 0; if x has value 1, x has value 0. These deﬁnitions are compactly given by the following rules for Boolean algebra: 1 + 1 = 1, 1 + 0 = 1, 0 + 0 = 0, 1 · 1 = 1, 1 · 0 = 0, 0 · 0 = 0, and 1 = 0, 0 = 1. Sometimes the arguments and values of Boolean functions are expressed in terms of the constants T (True) and F (False) instead of 1 and 0, respectively. 15
and vertices having value 0 are labeled with a small circle. x2 x1x2 x2 x1 + x2 x1 and x2 x1x2 + x1x2 x3 or x1 x1x2x3 + x1x2x3 + x1x2x3 + x1x2x3 x1 xor (exclusive or) x1 x2 even parity function Figure 2. A 3dimensional cube has 23 = 8 vertices. Instead. Thus. 3 and each may be labeled in two diﬀerent ways.1 we show some 2. we have DeMorgan’s laws (which can be veriﬁed by using the above deﬁnitions): x1 x2 = x1 + x2 .1. For a function of n variables.2 Diagrammatic Representations We saw in the last chapter that a Boolean function could be represented by labeling the vertices of a cube. One consisting of either a single variable or its complement. x1 (x2 x3 ) = (x1 x2 )x3 .and 3dimensional examples. we would need an ndimensional hypercube. it is easy to see how many Boolean functions of n dimensions there are. 2. is called a literal. for example. A Boolean formula consisting of a single variable. such as x1 is called an atom. and both can be written simply as x1 x2 x3 . The operators · and + do not commute between themselves. BOOLEAN FUNCTIONS The connectives · and + are each commutative and associative.1: Representing Boolean Functions on Cubes Using the hypercube representations. 2. Vertices having value 1 are labeled with a small square.16 CHAPTER 2. such as x1 . In Fig. and x1 + x2 = x1 x2 . thus there are 2(2 ) = 256 . Similarly for +.
where the li are literals. [Wnek. otherwise it has value 0. x3.x4 00 01 11 10 00 1 0 1 0 01 0 1 0 1 11 1 0 1 0 10 0 1 0 1 x1. there are 22 Boolean functions of n variables. 2. (An even parity function is a Boolean function that has value 1 if there are an even number of its arguments that have value 1. The rows and columns are arranged in such a way that entries that are adjacent in the map correspond to vertices that are adjacent in the hypercube representation.2.2. One diagrammatic technique for dimensions slightly higher than 3 is the Karnaugh map. we limit the class of hypotheses.and 3dimensional cubes later to provide some intuition about the properties of certain Boolean functions.2: A Karnaugh Map 2. . we frequently use some of the common subclasses of those functions. We show an example of the 4dimensional even parity function in Fig. so we must be careful in using intuitions gained in low dimensions.. et al. it will be important to know about these subclasses. Such a form is called a conjunction of literals. Some example terms are x1 x7 and x1 x2 x4 . A term is any function written in the form l1 l2 · · · lk . we cannot visualize hypercubes (for n > 3). One basic subclass is called terms. A Karnaugh map is an array of values of a Boolean function in which the horizontal rows are indexed by the values of some of the variables and the vertical columns are indexed by the rest. Of course. CLASSES OF BOOLEAN FUNCTIONS n 17 diﬀerent Boolean functions of 3 variables. The size of a term is the number of literals it contains. (Strictly speaking. We will be using 2.) Note that all adjacent cells in the table correspond to inputs diﬀering in only one component.2.x2 Figure 2. respectively. In general. the class of conjunctions of literals is called the monomials. The examples are of sizes 2 and 3.2 2. In learning Boolean functions.2. 1990]. Also describe general logic diagrams. Therefore. and there are many surprising properties of higher dimensional spaces.1 Classes of Boolean Functions Terms and Clauses To use absolute bias in machine learning.
a term. Two of them isolate onedimensional edges. Such a form is called a disjunction of literals. and vice versa. and the third isolates a zerodimensional vertex. BOOLEAN FUNCTIONS and a conjunction of literals itself is called a term.2. where C(i. f .) Thus. 2. Probably I’ll put in a simple termlearning algorithm here—so we can get started on learning! Also for DNF functions and decision lists—as they are deﬁned in the next few pages.2 DNF Functions A Boolean function is said to be in disjunctive normal form (DNF) if it can be written as a disjunction of terms. respectively. j) = (i−j)!j! is the binomial coeﬃcient. The relationship between implicants and prime implicants can be geometrically illustrated using the cube representation for Boolean functions. but none cuts oﬀ any vertices having value 0. If f is a term. i) clauses of size k or less.18 CHAPTER 2. In psychological experiments. Note that each of the three planes in the ﬁgure “cuts oﬀ” a group of vertices having value 1. both x2 x3 and x1 x3 are prime implicants of f = x2 x3 +x1 x3 +x2 x1 x3 . conjunctions of literals seem easier for humans to learn than disjunctions of literals. We illustrate it in Fig. where the li are literals. There are 3n possible clauses and fewer than i=0 C(2n. Each term in a DNF expression for a function is called an implicant because it “implies” the function (if the term has value 1.) It is easy to show that there are exactly 3n possible terms of n variables. is a prime implicant of f if the term. t. Thus. These planes are pictorial devices used to isolate certain lower dimensional subfaces of the cube. A term. This distinction is a ﬁne one which we elect to blur here. Both expressions are in the class 3DNF. so does the function). Some examples in DNF are: f = x1 x2 +x2 x3 x4 and f = x1 x3 + x2 x3 + x1 x2 x3 . The function is written as the disjunction of the implicants—corresponding to the union of all the vertices cut oﬀ by all of the planes. terms and clauses are duals of each other. Each group of vertices on a subface corresponds to one of the implicants of the function. an implicant is prime if and only if its corresponding subface is the largest dimensional subface that includes all of its vertices and . Consider. and thus each implicant corresponds to a subface of some dimension. Some example clauses are x3 + x5 + x6 and x1 + x4 . but x2 x1 x3 is not. f . In general. then (by De Morgan’s laws) f is a clause.3. The examples above are 2term and 3term expressions. t. A clause is any function written in the form l1 + l2 + · · · + lk . t . if f has value 1 whenever t does. formed by taking any literal out of an implicant t is no longer an implicant of f . the function f = x2 x3 + x1 x3 + x2 x1 x3 . (The implicant cannot be “divided” by any term and remain an implicant. i) = i! O(nk ). A DNF expression is called a kterm DNF expression if it is a disjunction of k terms. A kdimensional subface corresponds to an (n − k)size implicant term. 2. for example. k The number of terms of size k or less is bounded from above by i=0 C(2n. is an implicant of a function. The size of a clause is the number of literals it k contains. Geometrically. it is in the class kDNF if the size of its largest term is k.
but that representation is not unique. 0. CLASSES OF BOOLEAN FUNCTIONS 19 no other vertices having value 0. 0 x2 x1 f = x2x3 + x1x3 + x2x1x3 = x2x3 + x1x3 x2x3 and x1x3 are prime implicants Figure 2.) The other two implicants are prime because their corresponding subfaces cannot be expanded without including vertices having value 0. 2. . 1. All Boolean functions can also be represented in DNF in which each term is a prime implicant. 0. Whereas there are 22 functions of n dimensions in DNF (since k any Boolean function can be written in DNF). we don’t even have to include this term in the function because the vertex cut oﬀ by the plane corresponding to x2 x1 x3 is already cut oﬀ by the plane corresponding to x2 x3 . there are just 2O(n ) functions in kDNF. we can use the consensus method to ﬁnd an expression for the function in which each term is a prime implicant. 1 1. x3 1. If we can express a function in DNF form. 1 1.2. Note that the term x2 x1 x3 is not a prime implicant of f . 0. The consensus method relies on two results: • Consensus: We may replace this section with one describing the QuineMcCluskey method instead. as shown in Fig. 1 0. (In this case.2.4.3: A Function and its Implicants Note that all Boolean functions can be represented in DNF—trivially by disjunctions of terms of size n where each term corresponds to one of the vertices n whose value is 1.
1 1. but there is not a unique representation Figure 2. 1 1. 0. f1 · f2 is called the consensus of xi · f1 and xi · f2 . Example: x1 x4 x5 subsumes x1 x4 x2 x5 . The terms x1 x2 and x1 x2 have no consensus since each term has more than one literal appearing complemented in the other. 0. 0. 1. 0 x2 x1 f = x2x3 + x1x3 + x1x2 = x1x2 + x1x3 All of the terms are prime implicants. Readers familiar with the resolution rule of inference will note that consensus is the dual of resolution. We say that f1 subsumes xi · f1 . 1 0. • Subsumption: xi · f1 + f1 = f1 where f1 is a term. BOOLEAN FUNCTIONS x3 1.4: NonUniqueness of Representation by Prime Implicants xi · f1 + xi · f2 = xi · f1 + xi · f2 + f1 · f2 where f1 and f2 are terms such that no literal appearing in f1 appears complemented in f2 .20 CHAPTER 2. Examples: x1 is the consensus of x1 x2 and x1 x2 .
CLASSES OF BOOLEAN FUNCTIONS 21 The consensus method for ﬁnding a set of prime implicants for a function. A Boolean function is said to be in CNF if it can be written as a conjunction of clauses. The ﬁnal form of the function in which all terms are prime implicants is: f = x1 x2 + x1 x3 + x1 x4 x5 . iterates the following operations on the terms of a DNF expression for f until no more such operations can be applied: a. the terms remaining in T are all prime implicants of f. Example: Let f = x1 x2 + x1 x2 x3 + x1 x2 x3 x4 x5 . Shaded boxes surrounding a term indicate that it was subsumed. c. eliminate any terms in T that are subsumed by other terms in T .2. T . We show a derivation of a set of prime implicants in the consensus tree of Fig. b. 2.2.5. of terms in the DNF expression of f. compute the consensus of a pair of terms in T and add the result to T . .3 CNF Functions Disjunctive normal form has a dual: conjunctive normal form (CNF). 2 x1x2 x1x2x3 4 x1 x2 x3 x4 x5 x1x3 1 3 x1x2x4x5 6 f = x1x2 + x x + x x x 1 3 1 4 5 5 x1x4x5 Figure 2. The circled numbers adjoining the terms indicate the order in which the consensus and subsumption operations were performed. f .2. initialize the process with the set.5: A Consensus Tree 2. Its terms are all of the nonsubsumed terms in the consensus tree. When this process halts.
1) (x1 x2 x3 . 1993]). Because CNF and DNF are duals. . There are 2O[n k log(n)] functions in kDL [Rivest. and vice versa. 1987]. . and x3 = 1.4 Decision Lists Rivest has proposed a class of Boolean functions called decision lists [Rivest. an application of De Morgan’s law renders f k in CNF.2. vq ) (tq−1 . 1) (1. . xn ). . because the last one does. BOOLEAN FUNCTIONS An example in CNF is: f = (x1 + x2 )(x2 + x3 + x4 ). The example is a 2clause expression in 3CNF. The value of a decision list is the value of vi for the ﬁrst ti in the list that has value 1. . and x3 = 1. It has value 1 for x1 = 1. x2 = 0. vi ) ··· (t2 . (At least one ti will have value 1. A decision list is written as an ordered list of pairs: (tq . vq−1 ) ··· (ti . and T is a term whose value is 1 (regardless of the values of the xi ). 0) x2 x3 . ti . Interesting generalizations of decision lists use other Boolean functions in place of the terms. The class of decision lists of size k or less is called kDL. A CNF expression is called a kclause CNF expression if it is a conjunction of k clauses. there are also 2O(n ) functions in kCNF. x2 = 0. If f is written in DNF. An example decision list is: f= (x1 x2 . the ti are terms in (x1 . 2.) The decision list is of size k. v1 ) where the vi are either 0 or 1. 0) f has value 0 for x1 = 0. It has been shown that the class kDL is a strict superset of the union of k kDNF and kCNF. v1 can be regarded as a default value of the decision list. if the size of the largest term in it is k.22 CHAPTER 2. 1987]. This function is in 3DL. For example we might use linearly separable functions in place of the ti (see below and [Marchand & Golea. v2 ) (T. it is in the class kCNF if the size of its largest clause is k.
. If k = 1. . if k = n. wn ) is an ndimensional vector of weight values. θ) is 1 if σ ≥ θ and 0 otherwise. 1. while two are not. illustrated in Fig. if k = (n + 1)/2 for n odd or k = 1 + n/2 for n even. a voting function is the same as an nsized term. we have the majority function. . . W = (w1 . are realvalued numbers called weights. and X · W is the dot (or inner) product of the two vectors. . 2. a voting function is the same as an nsized clause. xn ) is an ndimensional vector of input variables.2. it is easy to see that two of the functions in Fig. 2. Input vectors for which f has value 1 lie in a halfspace on one side of (and on) a hyperplane whose orientation is normal to W and whose position (with respect to the origin) is determined by θ. The parity functions. i = 1. but the following table gives the numbers for n up to 6. We saw an example of such a separating plane in Fig. .2.2. . θ) where wi . The or and and functions of two dimensions are also symmetric. Thus. . With this idea in mind.) The kvoting functions are all members of the class of linearly separable functions in which the weights all have unit value and the threshold depends on k.5 Symmetric and Voting Functions A Boolean function is called symmetric if it is invariant under permutations of the input variables.4 are linearly separable functions as evidenced by the separating planes shown. (The exclusive or function. CLASSES OF BOOLEAN FUNCTIONS 23 2.3 and 2. (Note that the concept of linearly separable functions can be extended to nonBoolean inputs. θ) where X = (x1 .1. A convenient way to write linearly separable functions uses vector notation: f = thresh(X · W. is an oddparity function of two dimensions.6. terms and clauses are special cases of linearly separable functions. 2. . which have value 1 depending on whether or not the number of input variables with value 1 is even or odd is a symmetric function. any function that is dependent only on the number of input variables whose values are 1 is a symmetric function. Also note that the terms in Figs.2. .) An important subclass of the symmetric functions is the class of voting functions (also called mofn functions). . θ is a realvalued number called the threshold. There is no closedform expression for the number of linearly separable functions of n dimensions. . and thresh(σ.6 Linearly Separable Functions The linearly separable functions are those that can be expressed as follows: n f = thresh( i=1 wi xi .1 are linearly separable. . n. 2. For example. A kvoting function has value 1 if and only if k or more of its n inputs has value 1.
Winder.8 × 1019 Linearly Separable Functions 4 14 104 1.572 15. (See also [Winder. BOOLEAN FUNCTIONS n 1 2 3 4 5 6 Boolean Functions 4 16 256 65.882 94. ksizeterms kDNF kDL terms DNF (All) lin sep Figure 2.6: Classes of Boolean Functions The sizes of the various classes are given in the following table (adapted from [Dietterich.3 Summary The diagram in Fig.) 2.6 shows some of the set inclusions of the classes of Boolean functions that we have considered. 2.134 2 [Muroga. 1962].24 CHAPTER 2. page 262]): . 1990. 1971] has shown that (for n > 1) there are no more than 2n linearly separable functions of n dimensions. We will be confronting these classes again in later chapters.028. 1961.3 × 109 ≈ 1.536 ≈ 4.
. BIBLIOGRAPHICAL AND HISTORICAL REMARKS Class terms clauses kterm DNF kclause CNF kDNF kCNF kDL lin sep DNF Size of Class 3n 3n O(kn) 2 2O(kn) k 2O(n ) k 2O(n ) k 2O[n k log(n)] 2 2O(n ) n 22 25 2.4.4 Bibliographical and Historical Remarks To be added.2.
26 CHAPTER 2. BOOLEAN FUNCTIONS .
Ξ. Hv . that is consistent with these values. we say a mistake is made. if this majority is not equal to the value of the pattern presented. Consider the following procedure for classifying an arbitrary input pattern. the version space is that subset of hypotheses. 3.1 Version Spaces and Mistake Bounds The ﬁrst learning methods we present are based on the concepts of version spaces and version graphs. where H is the number of hypotheses in the 27 . is consistent with the values of X in Ξ if and only if h(X) = f (X) for all X in Ξ. A hypothesis. h. An incremental training procedure could then be deﬁned which presented each pattern in Ξ to each of these functions and then eliminated those functions whose values for that pattern did not agree with its given value. We could imagine (conceptually only!) that we have devices for implementing every function in H. We say that the hypotheses in H that are not consistent with the values in the training set are ruled out by the training set. whenever a mistake is made. Given an initial hypothesis set H (a subset of all Boolean functions) and the values of f (X) for each X in a training set. These ideas are most clearly explained for the case of Boolean function learning.1. During the learning procedure. we rule out at least half of the functions remaining in the version space. X: the pattern is put in the same class (0 or 1) as are the majority of the outputs of the functions in the version space. How many mistakes can such a procedure make? Obviously. this subset is the version space for the patterns already presented. This idea is illustrated in Fig. At any stage of the process we would then have left some subset of functions that are consistent with the patterns presented so far. we can make no more than log2 (H) mistakes. Thus.Chapter 3 Using Version Spaces for Learning 3. and we revise the version space accordingly—eliminating all those (majority of the) functions voting incorrectly.
H.585n mistakes before learning f . As a special case. of all Boolean Functions h1 h2 1 or 0 Rule out hypotheses not consistent with training patterns X hi hj Hypotheses not ruled out constitute the version space hK K = H Figure 3. it may be that there are enough training patterns to reduce the version space to a set of functions such that most of them assign the same values to most of the patterns we will see henceforth. It shows that there must exist a learning procedure that makes no more mistakes than this upper bound.1: Implementing the Version Space original hypothesis set. USING VERSION SPACES FOR LEARNING A Subset. we would make no more than 1. 1988]) is an example of a mistake bound—an important concept in machine learning theory. (Note. We next discuss a computationally more feasible method for representing the version space. . though. Even if we do not have suﬃcient training patterns to reduce the version space to a single function. if our bias was to limit H to terms. Later. and otherwise we would make no more than that number of mistakes before being able to decide that f is not a term.585n mistakes before exhausting the version space. H.28 CHAPTER 3. that the number of training patterns seen before this maximum number of mistakes is made might be much greater. we’ll derive other mistake bounds. We could select one of the remaining functions at random and be reasonably assured that it will generalize satisfactorily. we would make no more than log2 (3n ) = n log2 (3) = 1. This result means that if f were a term.) This theoretical (and very impractical!) result (due to [Littlestone.
corresponds to the node at the top of the graph. (and f2 is more speciﬁc than f1 ). In Fig. We can form a graph with the hypotheses. hj . A node in the graph. if f1 has value 1 for all of the arguments for which f2 has value 1. VERSION GRAPHS 29 3. A Boolean function. Just below “1” is a row of nodes corresponding to all terms having just one literal. is more general than a function.2. 3. f1 . the function “0” is at the bottom of the graph. We call such a graph a version graph. has an arc directed to node.) Similarly.2. f2 . in the version space as nodes. and just below them is a row of nodes corresponding to terms having two literals. For example. x3 Version Graph for Terms (none yet ruled out) 1 x2 x1 x1 x1 x2 x2 x3 (k = 1) x3 (k = 2) x1 x3 x1x2 x3 x1x2 (k = 3) 0 (for simplicity. x3 is more general than x2 x3 but is not more general than x3 + x2 .2: A Version Graph for Terms That function.” which has value 1 for all inputs. denoted here by “1. we show an example of a version graph over a 3dimensional input space for hypotheses restricted to terms (with none of them yet ruled out). hi . only some arcs in the graph are shown) Figure 3.3. and f1 = f2 .2 Version Graphs Boolean functions can be ordered by generality. (It is more general than any other term. and . {hi }. if and only if hj is more general than hi .
0. 3. there are always a set of hypotheses that are maximally general and a set of hypotheses that are maximally speciﬁc. 1 has value 0 x2 1 New Version Graph x1 ruled out nodes x1 x1 x2 x2 x1x2 x3 x3 x2 x3 x1x2 x1x3 x1x2 x3 x1x3 x1x2x3 0 (only some arcs in the graph are shown) Figure 3. We also show there the threedimensional cube representation in which the vertex (1. 1) with value 0.3: The Version Graph Upon Seeing (1. 0.30 CHAPTER 3.3. The gbs and sbs are shown. 0) has value 1.2 are inconsistent with this training pattern. 0. 1) In a version graph. respectively. These are called the general boundary set (gbs) and the speciﬁc boundary set (sbs). Some of the functions in the version graph of Fig. 3. There are 33 = 27 functions altogether (the function “0. These ruled out nodes are no longer in the version graph and are shown shaded in Fig. 0. is technically not a term). 1) has value 0.1) has value 0 and (1.4. To make our portrayal of the graph less cluttered only some of the arcs are shown.0. 0. In Fig. Suppose we ﬁrst consider the training pattern (1. Ξ. We use this same example to show how the version graph changes as we consider a set of labeled samples in a training set.” included in the graph. we have the version graph as it exists after learning that (1. each node in the actual graph has an arc directed to all of the nodes above it that are more general. . USING VERSION SPACES FOR LEARNING so on. 3. x3 1.
4: The Version Graph Upon Seeing (1. 1) and (1. more general than sbs x2 general boundary set (gbs) x3 x1 x1x3 x2x3 x1x2 x1 x2 x2 x3 x1x2 x3 specific boundary set (sbs) 0 Figure 3. 0) itself—corresponding to the function x1 x2 x3 . 0. A maximally general one corresponds to a subface of maximal dimension that contains all the members of the training set labelled by a 1 and no members labelled by a 0. 1) is just the vertex (1. A maximally speciﬁc one corresponds to a subface of minimal dimension that contains all the members of the training set labelled by a 1 and no members labelled by a 0. 0.3. 0 has value 1 x1 1 more specific than gbs. we see that the subface of minimal dimension that contains (1. 0. This determination is possible because of the fact that any member of the version space (that is not a member of one of the boundary sets) is more speciﬁc than some member of the general boundary set and is more general than some member of the speciﬁc boundary set. VERSION GRAPHS 31 x3 1. The subface . which would be impractical. 0. 1 has value 0 1.2.4. it is a simple matter to determine maximally general and maximally speciﬁc functions (assuming that there is some term that is in the version space). 0) Boundary sets are important because they provide an alternative to representing the entire version space explicitly. 0. 0. Looking at Fig. 0. 3. 0) but does not contain (1. Given only the boundary sets. it is possible to determine whether or not any hypothesis (in the prescribed class of Boolean functions we are using) is a member or not of the version space. If we limit our Boolean functions that can be in the version space to terms.
For a negative instance the algorithm specializes elements of the [gbs] so that they no longer cover the new instance yet remain consistent with past data. Quoting from [Hirsh. negative instance. 3. page 6]: “The candidateelimination algorithm manipulates the boundaryset representation of a version space to create boundary sets that represent a new version space consistent with all the previous instances plus the new one. We shall see instances of both styles of learning in this book. 1) is the bottom face of the cube—corresponding to the function x3 . 3. and removes from the [sbs] those elements that mistakenly cover the new. and removes those elements of the [gbs] that do not cover the new instance. Version spaces for terms always have singular speciﬁc boundary sets.2 through 3. 3. As seen in Fig. In Figs. is an incremental method for computing the boundary sets. 1994. For a positive exmple the algorithm generalizes the elements of the [sbs] as little as possible so that they cover the new instance yet remain consistent with past data. USING VERSION SPACES FOR LEARNING of maximal dimension that contains (1. [To be written. the gbs of a version space for terms need not be singular. 0.32 CHAPTER 3.4 The Candidate Elimination Method The candidate elimination method.3. Or.3 Learning as Search of a Version Space Compare this view of topdown versus bottomup with the divideandconquer and the covering (or AQ) methods of decisiontree induction. Relate to term learning algorithm presented in Chapter Two. See Pat Langley’s example using “pseudocells” of how to generate and eliminate hypotheses. however. 0) but does not contain (1.4 the sbs is always singular.] Selecting a hypothesis from the version space can be thought of as a search problem. 0.” The method uses the [Genesereth & Nilsson. One can start with a very general function and specialize it through various specialization operators until one ﬁnds a function that is consistent (or adequately so) with a set of training patterns. one can start with a very special function and generalize it—resulting in bottomup methods. Also discuss bestﬁrst search methods. Such procedures are usually called topdown methods. 1987]): following deﬁnitions (adapted from • a hypothesis is called suﬃcient if and only if it has value 1 for all training samples labeled by a 1. . 3. • a hypothesis is called necessary if and only if it has value 0 for all training samples labeled by a 0.
1. in it by all of its least generalizations. (That is. it might be that hs = h. Again. 1) (0. The hypothesis hg is a least generalization of h if and only if: a) h is more speciﬁc than hg . least generalizations of two diﬀerent functions in the speciﬁc boundary set may be identical. hi . we exclude any elements that have value 1 for the new vector. If the new vector is labelled with a 0: The new speciﬁc boundary set is obtained from the previous one by excluding any elements in it that are not necessary. and least specializations of two diﬀerent functions in the general boundary set may be identical. we exclude any elements that have value 0 for the new vector. 0. 1) label 0 1 0 0 . suppose we present the vectors in the following order: vector (1. b. THE CANDIDATE ELIMINATION METHOD 33 Here is how to think about these deﬁnitions: A hypothesis implements a suﬃcient condition that a training sample has value 1 if the hypothesis has value 1 for all of the positive instances. 0. and d) hg is more speciﬁc than some member of the new general boundary set. (That is. hi . As an example. in it by all of its least specializations.3. c) no function (including h) that is more speciﬁc than hg is suﬃcient.) The new speciﬁc boundary set is obtained from the previous one by replacing each element. 1) (1. 0) (1.4. We start (before receiving any members of the training set) with the function “0” as the singleton element of the speciﬁc boundary set and with the function “1” as the singleton element of the general boundary set. b) hs is necessary. the boundary sets are changed as follows: a. It might be that hg = h. b) hg is suﬃcient.) The new general boundary set is obtained from the previous one by replacing each element. A hypothesis is consistent with the training set (and thus is in the version space) if and only if it is both suﬃcient and necessary. and d) hs is more general than some member of the new speciﬁc boundary set. The hypothesis hs is a least specialization of h if and only if: a) h is more general than hs . Also. c) no function (including h) that is more general than hs is necessary. 0. Upon receiving a new labeled input vector. a hypothesis implements a necessary condition that a training sample has value 1 if the hypothesis has value 0 for all of the negative instances. If the new vector is labelled with a 1: The new general boundary set is obtained from the previous one by excluding any elements in it that are not suﬃcient.
and the speciﬁc boundary set is changed to {x1 x2 x3 }. This single function is a least generalization of “0” (it is suﬃcient. 0. “0” is more speciﬁc than it. they are more general than “0”. 3. When we see (1. x2 . and we change the general boundary set to {x1 . 1). 1994] to allow hypotheses that are not necessarily consistent with the training set. “1”. 1). Although these ideas are not used in practical machine learning procedures. labeled with a 0. 0. and x3 .5 Bibliographical and Historical Remarks More to be added. and speciﬁc boundary set. they do provide insight into the nature of hypothesis selection. 1).34 CHAPTER 3. “0. 1982]. the speciﬁc boundary set stays at “0” (it is necessary). (1. labeled with a 0. x1 . 0. . x2 . 1. The concept of version spaces and their role in learning was ﬁrst investigated by Tom Mitchell [Mitchell. We do not change the general boundary set either because x3 is still necessary. we do not change the speciﬁc boundary set because its function is still necessary. are least specializations of “1” (they are necessary. we do not change the speciﬁc boundary set because its function is still necessary. the general boundary set changes to {x3 } (because x1 and x2 are not suﬃcient). after seeing (1. We do not change the general boundary set either because x3 is still necessary. x3 }. 0). USING VERSION SPACES FOR LEARNING We start with general boundary set. “1” is not.” After seeing the ﬁrst sample. when we see (0. Then. and there are no functions that are more general than they and also necessary). and it is more speciﬁc than some member of the general boundary set. labeled with a 0. Maybe I’ll put in an example of a version graph for nonBoolean functions. version spaces have been generalized by [Hirsh. Each of the functions. no function (including “0”) that is more speciﬁc than it is suﬃcient. labeled with a 1. Finally. In order to accomodate noisy data.
interconnected through adjustable weights. a perceptron. 1958]. 4. We begin our treatment of neural nets by studying this threshold element and how it can be used in the simplest of all networks. We next have the question of how best to implement the function as a device that gives the outputs prescribed by the function for arbitrary inputs. Suppose we decide to use one of these subsets as a hypothesis set for supervised function learning. This structure we call a threshold logic unit (TLU). play a prominent role in machine learning. (Although the word “perceptron” is often used nowadays to refer to a single TLU.1 4. 1990]. These networks commonly use the threshold element which we encountered in chapter two in our study of linearly separable Boolean functions.1. In this chapter we describe how networks of nonlinear elements can be used to implement various inputoutput functions and how they can be trained using supervised learning methods. 1962. Widrow & Lehr.Chapter 4 Neural Networks In chapter two we deﬁned several important subsets of Boolean functions. 4.) 35 .1. They are called neural networks because the nonlinear elements have as their inputs a weighted sum of the outputs of other elements—much like networks of biological neurons do. Rosenblatt originally deﬁned it as a class of networks of threshold elements [Rosenblatt. namely ones composed of a single threshold element. an LTU (linear threshold unit). Its output is 1 or 0 depending on whether or not the weighted sum of its inputs is greater than or equal to a threshold value. θ. and a neuron. Networks of nonlinear elements.1 Threshold Logic Units Deﬁnitions and Geometry Linearly separable (threshold) functions are implemented in a straightforward way by summing the weighted inputs and comparing this sum to a threshold value as shown in Fig. It has also been called an Adaline (for adaptive linear element) [Widrow.
wn+1 . but we often specialize to the binary numbers 0 and 1. The (n + 1)st component. θ. we’ll use the Y. otherwise it has output 0. . where the “row” vector Xt is the transpose of X. X•W.) Often. When we want to distinguish among diﬀerent feature vectors. 0) Figure 4. (When we want to emphasize the use of augmented vectors. . (The normal . (If the pattern and weight vectors are thought of as “column” vectors. . and V. The weighted sum that is calculated by the TLU can be simply represented as a vector dot product.) In the Y.V notation. however when the context of the discussion makes it clear about what sort of vectors we are talking about. . Otherwise. is set equal to the negative of the desired threshold value. in that case. this dot product is then sometimes written as Xt W. The hyperplane is the boundary between patterns for which X•W + wn+1 > 0 and patterns for which X•W + wn+1 < 0. The weights of a TLU are represented by an ndimensional weight vector. The components of X can be any realvalued numbers. . the equation of the hyperplane itself is W X•W + wn+1 = 0.2. . + wn ) is the length of the vector W. of the augmented feature vector. the TLU has an output of 1 if Y•V ≥ 0. Y. respectively. Thus. Its components are realvalued numbers (but we sometimes specialize to integers). NEURAL NETWORKS X x1 x 2 x i xn W w1 w2 wi wn wn+1 threshold " = 0 ! threshold weight n+1 f xn+1 = 1 f = thresh( i=1 ! wi xi. xn ). . 4. the output is 0. wn ). whose ﬁrst n components are the same as those of X and W. the (n + 1)st component. Y. always has value 1. . n The TLU has output 1 if i=1 xi wi ≥ θ.36 CHAPTER 4. V. arbitrary thresholds are achieved by using (n + 1)dimensional “augmented” vectors. A TLU divides the input space by a hyperplane as sketched in Fig. such as Xi . The unit vector that is normal to the hyperplane is n = W . where W = 2 2 (w1 + .1: A Threshold Logic Unit (TLU) The ndimensional feature or input vector is denoted by X = (x1 . . of the TLU is ﬁxed at 0. We can give an intuitively useful geometric description of a TLU.V notation. we’ll lapse back into the more familiar X. W = (w1 . . we will attach subscripts. the threshold. of the augmented weight vector.W notation. xn+1 .
When the distance from the hyperplane to the W origin is negative (that is. Thus. training of a TLU can be achieved by adjusting the values of the weights. In this way the hyperplane can be moved so that the TLU implements diﬀerent (linearly separable) functions of the input.W + wn+1 > 0 X wn+1 W on this side X W + wn+1 X.W + wn+1 < 0 on this side W Origin n= W W Unit vector normal to hyperplane Figure 4. to the hyperplane is X•W+wn+1 . and a weight of −1 is used from an input corresponding to a negative literal. changes the orientation of the hyperplane. X.) The distance from the wn+1 hyperplane to the origin is W . is set equal to kp − 1/2.2 Terms Special Cases of Linearly Separable Functions Any term of size k can be implemented by a TLU with a weight from each of those inputs corresponding to variables occurring in the term.1. θ.4. Equations of hyperplane: X W + wn+1 = 0 wn+1 =0 X n+ W X. 4. Such a TLU implements a hyperplane boundary that is . (Literals not mentioned in the term have weights of zero—that is. when wn+1 < 0).) The threshold. then the origin is on the negative side of the hyperplane (that is. and the distance from an arbitrary point. no connection at all—from their inputs.2: TLU Geometry Adjusting the weight vector.1. the side for which X•W + wn+1 < 0). adjusting wn+1 changes the position of the hyperplane (relative to the origin). where kp is the number of positive literals in the term. A weight of +1 is used from an input corresponding to a positive literal. W. THRESHOLD LOGIC UNITS 37 W form of the hyperplane equation is X•n + W = 0.
3. linearly separable functions are also a superset of clauses. This process simply changes the orientation of the hyperplane—ﬂipping it around by 180 degrees and thus changing its “positive side. We use augmented feature and weight vectors in describing them. Inverting a hyperplane is done by multiplying all of the TLU weights—even wn+1 —by −1. it will implement the clause instead. the negation of the clause f = x1 + x2 + x3 is the term f = x1 x2 x3 . of vectors. 4. they are called errorcorrection procedures. a. 4.1. linearly separable functions are a superset of terms.3/2 = 0 Figure 4. Yi . A hyperplane can be used to implement this term.1) x2 (1. Ξ.1.4. For example. . These methods make adjustments to the weight vector only when the TLU being trained makes an error on a training pattern. NEURAL NETWORKS parallel to a subface of dimension (n − k) of the unit hypercube. We present next a family of incremental training procedures with parameter c.1.” Therefore. If we “invert” the hyperplane. We show a threedimensional example in Fig. x3 f = x1x2 (1. We show an example in Fig. We start with a ﬁnite training set.0) x1 Equation of plane is: x1 + x2 . Thus.3 ErrorCorrection Training of a TLU There are several procedures that have been proposed for adjusting the weights of a TLU. 4.38 CHAPTER 4.3: Implementing a Term Clauses The negation of a clause is a term. and their binary labels.
1. which is smaller than it was before the weight adjustment. THRESHOLD LOGIC UNITS 39 f = x1 + x2 + x3 x3 f = x1x2x3 x2 x1 Equation of plane is: x1 + x2 + x3 1/2 = 0 Figure 4. c. (a) If the TLU responds correctly. of vectors from Ξ and their labels such that each member of Ξ occurs inﬁnitely often in Σ. in Σ to the TLU and note its response.4: Implementing a Clause b. Σ. modify the weight vector as follows: V ←− V − ci Yi where ci is a positive real number called the learning rate parameter (whose value is diﬀererent in diﬀerent instances of this family of procedures and may depend on i). the new dot product will be (V + ci Yi )•Yi = V•Yi + ci Yi •Yi . Note that after this adjustment the new dot product will be (V − ci Yi )•Yi = V•Yi − ci Yi •Yi .4. make no change in the weight vector. modify the weight vector as follows: V ←− V + ci Yi In this case. Set the initial weight values of an TLU to arbitrary values. (c) If Yi is supposed to produce an output of 1 and produces an output of 0 instead. Repeat forever: Present the next vector. Yi . (b) If Yi is supposed to produce an output of 0 and produces an output of 1 instead. Compose an inﬁnite training sequence. which is larger than it was before the weight adjustment. Note that all three of these cases can be combined in the following rule: .
40 CHAPTER 4. If λ = 1. Now. Note that if λ = 0. the error will be corrected. Weight points in one of the halfspaces deﬁned by this hyperplane will cause the corresponding pattern to yield a dot product less than 0. the correction is just suﬃcient to make Yi •V = 0. 3. 4. proofs. that produces a correct output for all of the feature vectors in Ξ. For additional background.] Note also that because the weight vector V now includes the wn+1 threshold component. Depending on the value of this constant. positive constant for all i. for any pattern vector. the threshold of the TLU is also changed by these adjustments. It can be proved that if there is some weight vector. ci .4 Weight Space We can give an intuitive idea about how these procedures work by considering what happens to the augmented weight vector in “weight space” as corrections are made. the parameter ci is set to λ Yii•Yi . See [Maass & Tur´n.1. 2. corresponding to patterns Y1 . then corresponds to a point in (n + 1)dimensional weight space. V. and examples of errorcorrection procedures. Yi •V. .5. then after a ﬁnite number of feature vector presentations. is the same ﬁxed. 1. the ﬁxedincrement procedure will ﬁnd such a weight vector and thus make no more weight changes. see [Nilsson. and fi is the actual response (1 or 0) for Yi . consider the locus of all points in weight space corresponding to weight vectors yielding Yi •V = 0. We identify two versions of this procedure: 1) In the ﬁxedincrement procedure. where V is the weight vector before it is changed. If λ > 1. against a threshold of 0. This locus is a hyperplane passing through the origin of the (n + 1)dimensional space. 4. and weight points in the other halfspace will cause the corresponding pattern to yield a dot product greater than 0. The same result holds for the fractionalcorrection procedure if 1 < λ ≤ 2. A particular weight vector. Each pattern vector will have such a hyperplane corresponding to it. We show a schematic representation of such a weight space in Fig. Yi . NEURAL NETWORKS V ←− V + ci (di − fi )Yi where di is the desired response (1 or 0) for Yi . the learning rate parameter. 4 . 1990]. 1994] for a a hyperplaneﬁnding procedure that makes no more than O(n2 log n) mistakes. no correction takes place at all. We use augmented vectors in our discussion here so that the threshold function compares the dot product. V. the weight adjustment may or may not correct the response to an erroneously classiﬁed feature vector. Y •V 2) In the fractionalcorrection procedure. There are four pattern hyperplanes.
and we indicate by an arrow the halfspace for each in which weight vectors give dot products greater than 0. If we do that for our example above. indicated in the ﬁgure is one such set of weight values. 4. V. Y3 . The weight point.4.6. 3 2 1 V 4 Figure 4. Suppose we wanted weight values that would give positive responses for patterns Y1 . To do so involves reversing the “polarity” of those hyperplanes corresponding to patterns for which a negative response is desired.6: Solution Region in Weight Space .1.5: Weight Space The question of whether or not there exists a weight vector that gives desired responses for a given set of patterns can be given a geometric interpretation. Y4 . respectively. we get the weight space diagram shown in Fig. 3 3 2 2 1 1 4 V 4 0 3 2 1 Figure 4. THRESHOLD LOGIC UNITS 41 Y2 . and a negative response for pattern Y2 . Y3 . and Y4 .
for later purposes. The . 3 V1 2 V5 1 V2 V3 V4 V6 4 V Figure 4. Y3 . The solution region will be a “hyperwedge” region whose vertex is at the origin of weight space and whose crosssection increases with increasing distance from the origin. called the solution region. we see that it gives an incorrect response for pattern Y1 .6. but eventually we work our way out far enough in the solution region that corrections (for a ﬁxed increment size) take us within it. and so on. the responses are only incorrect for planes bounding the solution region. as shown in Fig.7. Ultimately.1. Starting at V1 . 4. Some of the subsequent corrections may overshoot the solution region. the pattern labels are assumed to be either +1 or −1 (instead of 1 or 0). Suppose in our example that we present the patterns in the sequence Y1 . 4. the number of errors made by weight vectors in each of the regions. (That is what adding Y1 to V1 does. so we move V1 to V2 in a direction normal to plane 1. then the halfspaces deﬁned by the correct responses for these patterns will have a nonempty intersection. Y4 . (The boxed numbers show. For this purpose. NEURAL NETWORKS If a weight vector exists that correctly classiﬁes a set of patterns.5 The WidrowHoﬀ Procedure The WidrowHoﬀ procedure (also called the LMS or the delta procedure) attempts to ﬁnd weights that minimize a squarederror function between the pattern labels and the dot product computed by a TLU.) Y2 gives an incorrect response for pattern Y2 . This region is shown shaded in Fig.) The ﬁxedincrement errorcorrection procedure changes a weight vector by moving it normal to any pattern hyperplane for which that weight vector gives an incorrect response. and start the process with a weight point V1 .42 CHAPTER 4. The proofs for convergence of the ﬁxedincrement rule make this intuitive argument precise.7: Moving Into the Solution Region 4. Y2 .
does not equal the speciﬁed desired target n+1 n+1 . Yi is the ith augmented pattern vector. The jth component of the gradient of the singlepattern error is: ∂εi = −2(di − xij wj )xij ∂wj j=1 An adjustment in the direction of the negative gradient would then change each weight as follows: wj ←− wj + ci (di − fi )xij where fi = j=1 xij wj . One way to ﬁnd such a set of weights is to start with an arbitrary weight vector and move it along the negative gradient of ε as a function of the weights. as before. Each component of the gradient is the partial derivative of ε with respect to one of the weights. but the approximation is usually quite eﬀective. of Ξ at a time. The total squared error (over all patterns in a training set. notation) is thus adjusted according to the following rule: V ←− V + ci (di − fi )Yi where. Xi . The WidrowHoﬀ procedure makes adjustments to the weight vector whenever the dot product itself. we know that it has a global minimum. Since ε is quadratic in the wj . THRESHOLD LOGIC UNITS squared error for a pattern. and then try another member of Ξ. or V. containing m patterns) is then: m n+1 ε= i=1 (di − j=1 xij wj )2 We want to choose the weights wj to minimize this squared error. We will be describing the incremental version here. make the appropriate adjustment to the weights. it is preferable to use an incremental procedure in which we try the TLU on just one element. and thus this steepest descent procedure is guaranteed to ﬁnd the minimum. the results of the incremental version can only approximate those of the batch one. εi . and ci governs the size of the adjustment. Xi . Yi •V.1. One problem with taking the partial derivative of ε is that ε depends on all the input vectors in Ξ. The entire weight vector (in augmented. Often. Of course. Ξ.4. with label di (for desired output) is: n+1 43 εi = (di − j=1 xij wj )2 where xij is the jth component of Xi . compute the gradient of the singlepattern squared error.
called a linear machine. Errorcorrection proceeds as usual with the ordinary set of weights. A meansquarederror criterion often gives unsatisfactory results. which (although it will not converge to zero error on nonlinearly separable problems) will give us a weight vector that minimizes the meansquarederror. Duda [Duda. . Gallant [Gallant. the weights in the pocket are used as a set that should give a small number of errors on the training set. 1991. 4. After stopping. 160]: . 1966] has suggested keeping track of the average value of the weight vector during error correction and using this average to give a separating hyperplane that performs reasonably well on nonlinearlyseparable problems. see [Duda & Hart. p. Here. The latter is claimed to outperform the pocket algorithm. 1973. NEURAL NETWORKS Examples of training curves for TLU’s.) 4. consists simply in storing (or “putting in your pocket”) the set of weights which has had the longest unmodiﬁed run of successes so far. to use more . Finding weight values that give the desired dot products corresponds to solving a set of linear equalities. performance on test set. The only diﬀerence is that fi is the thresholded response of the TLU in the errorcorrection case while it is the dot product itself for the WidrowHoﬀ procedure. . value. might decrease with time toward 0 to achieve asymptotic convergence. . ci . . pp. & Palmer.2 Linear Machines The natural generalization of a (twocategory) TLU to an Rcategory classiﬁer is the structure.8. The WidrowHoﬀ formula for changing the weight vector has the same form as the standard ﬁxedincrement errorcorrection formula. error correction with a continuous decrease toward zero of the value of the learning rate constant. and the WidrowHoﬀ procedure can be interpreted as a descent procedure that attempts to minimize the meansquarederror between the actual and desired values of the dot product. First. Several methods have been proposed to deal with this case. cumulative number of corrections. the errorcorrection procedures will not do well on nonlinearlyseparable training sets because they will continue to attempt to correct inevitable errors. 1986] proposed what he called the “pocket algorithm. (For more on WidrowHoﬀ and other related procedures.1. however. c. performance on training set. it may still be desired to ﬁnd a “best” separating hyperplane. . will result in ever decreasing changes to the hyperplane. 151ﬀ]. 1995] and by [Marchand & Golea. . and the hyperplane will never settle into an acceptable place. The algorithm is stopped after some chosen time t . 1993]. The learningrate factor. Also see methods proposed by [John. As an alternative.6 Training a TLU on NonLinearlySeparable Training Sets When the training set is not linearly separable (perhaps because of noise or perhaps inherently).” As described in [Hertz. di (which is either 1 or −1).44 CHAPTER 4. we might use the WidrowHoﬀ procedure. Krogh. the pocket algorithm . 4. because it prefers many small errors to a few large ones. shown in Fig. Typically.
{1.X . a. . WR. Such a structure is also sometimes called a “competitive” net or a “winnertakeall” net. . corresponding to which dot product is largest. . R1 R2 R3 R4 R5 In this region: X. If the machine classiﬁes a pattern correctly. The output of the linear machine is one of the numbers. Note that when R = 2. .W4 X.. the linear machine reduces to a TLU with weight vector W = (W1 − W2 ). no change is made to any of .8: A Linear Machine The diagram in Fig. Assemble the patterns in the training set into a sequence as before. LINEAR MACHINES 45 familiar notation.. the Ws and X are meant to be augmented vectors (with an (n+1)st component).9 shows the character of the regions in a 2dimensional space created by a linear machine for R = 5. In n dimensions.9: Regions For a Linear Machine To train a linear machine.4.Wi for i 4 Figure 4. 4. R}. every pair of regions is either separated by a section of a hyperplane or is nonadjacent.2. W1 X WR W1.X ARGMAX Figure 4. there is a straightforward generalization of the 2category errorcorrection rule.
Note that when R = 2. NEURAL NETWORKS b. in category v (u = v). then: Wu ←− Wu + ci Xi and Wv ←− Wv − ci Xi and all other weight vectors are not changed. In the ﬁgure. this procedure reduces to the ordinary TLU errorcorrection procedure. No single line through the 2dimensional square can separate the vertices (1. This correction increases the value of the uth dot product and decreases the value of the vth dot product. if there exists weight vectors that make correct separations of the training set.1) and (0.46 the weight vectors. feedforward . 1973.3. for constant ci . 174177]. Just as in the 2category ﬁxed increment procedure. Feedforward networks have no cycles. for example. The function implemented by a network of TLUs depends on its topology as well as on the weights of the individual TLUs.0) from the vertices (1. A proof that this procedure terminates is given in [Nilsson. f = x1 x2 + x1 x2 . 4. in a feedforward network no TLU’s input depends (through zero or more intermediate TLUs) on that TLU’s output. 1990. pp. If the TLUs of a feedforward network are arranged in layers. even parity function. 8890] and in [Duda & Hart. 4. this procedure is guaranteed to terminate. Xi .10 does implement this function. pp.1 Networks of TLUs Motivation and Examples Layered Networks To classify correctly all of the patterns in nonlinearlyseparable training sets requires separating surfaces more complex than hyperplanes. we show the weight values along input lines to each TLU and the threshold value inside the circle representing the TLU. then we say that the network is a layered. (Networks that are not feedforward are called recurrent networks). with the elements of layer j receiving inputs only from TLUs in layer j − 1. If the machine mistakenly classiﬁes a category u pattern. the network of three TLUs shown in Fig.3 4. CHAPTER 4. One way to achieve more complex surfaces is with networks of TLUs. But. the 2dimensional.0) and (0.1)—the function is not linearly separable and thus cannot be implemented by a single TLU. Consider.
(We leave it to the reader to calculate appropriate values of weights and . 4.12.10 is a layered. The network shown in Fig.11: A Layered.5 f 0.10: A Network for the Even Parity Function network. consider the function f = x1 x2 + x2 x3 + x1 x3 . Since any Boolean function has a DNF form. All of the TLUs except the “output” units are called hidden units (they are “hidden” from the output). a feedforward. 4. any Boolean function can be implemented by some twolayer network of TLUs. A kterm DNF function can be implemented by a twolayer network with k units in the hidden layer—to implement the k terms—and one output unit to implement the disjunction of these terms. The form of the network that implements this function is shown in Fig. they would call this network a threelayer network.11. layered network has the structure shown in Fig. X output units hidden units Figure 4. NETWORKS OF TLUS 47 x1 1 x2 1 1 1 1. (Some people count the layers of TLUs and include the inputs as a layer also.4. As an example.3.5 1 0. feedforward network having two layers (of weights). 4.) In general.5 1 Figure 4. Feedforward Network Implementing DNF Functions by TwoLayer Networks We have already deﬁned kterm DNF functions—they are DNF functions having k terms.
NPhardness of optimal versions. NEURAL NETWORKS thresholds. so the weights in the ﬁnal layer are ﬁxed. 4.12: A TwoLayer Network f = x1x2 + x2x3 + x1x3 x3 x2 x1 Figure 4. Discuss halfspace intersections.48 CHAPTER 4. Later. we ﬁrst note that the output unit implements a disjunction. −1. The network of Fig.13: Three Planes Implemented by the Hidden Units To train a twolayer network that implements a kterm DNF function. 4. singlesideerrorhypeplane methods. . halfspace unions.13. The weights in the ﬁrst layer (except for the “threshold weights”) can all have values of 1. we will present a training procedure for this ﬁrst layer of weights. TLUs x disjunct disjunction of terms conjunctions conjuncts of literals (terms) A Feedforward. 4.13.12 can be designed so that each hidden unit implements one of the planar boundaries shown in Fig. relation to “AQ” methods.) The 3cube representation of the function is shown in Fig. or 0. 2layer Network Figure 4.
(We assume augmented vectors here even though we are using X. • If the Madaline incorrectly classiﬁes a pattern. Xi . feedforward network is the twolayer one which has an odd number of hidden units. no corrections are made to any of the hidden units’ weight vectors. then determine the minimum number of hidden units whose responses need to be changed (from 0 to 1 or from 1 to 0—depending on the type of error) in order that the Madaline would correctly classify Xi . Ridgway [Ridgway. If the ﬁrst layer does not partition the feature space in this way. We leave it to the reader to think about how this training procedure could be modiﬁed if the output TLU implemented an or function (or an and function). Xi .) That is. the procedure works eﬀectively in most experiments with it. W notation. this procedure will fail to ﬁnd them. Suppose that minimum number is ki . and we change those that are easiest to change. the response of the vote taking unit is deﬁned to be the response of the majority of the hidden units. There are example problems in which even though a set of weight values exists for a given Madaline structure such that it could classify all members of a training set correctly. NETWORKS OF TLUS Important Comment About Layered Networks 49 Adding additional layers cannot compensate for an inadequate ﬁrst layer of TLUs.4.2 Madalines TwoCategory Networks An interesting example of a layered. Nevertheless. Add diagrams showing the nonlinear transformation performed by a layered network. . 1962] proposed the following errorcorrection rule for adjusting the weights of the hidden units of a Madaline: • If the Madaline correctly classiﬁes a pattern.3.3. although other output logics are possible. Of those hidden units voting incorrectly. we perform errorcorrection on just enough hidden units to correct the vote to a majority voting correctly. 4. the ﬁnal outputs will not be consistent with the labeled training set. Typically. so that no two such vectors yield the same set of outputs of the ﬁrstlayer units). then regardless of what subsequent layers do. and a “votetaking” TLU as the output unit. Such a network was called a “Madaline” (for many adalines by Widrow. The ﬁrst layer of TLUs partitions the feature space so that no two differently labeled vectors are in the same region (that is. change the weight vectors of those ki of them whose dot products are closest to 0 by using the error correction rule: W ←− W + ci (di − fi )Xi where di is the desired response of the hidden unit (0 or 1) and fi is the actual response (0 or 1).
.14: A Piecewise Linear Machine .14. 1 MAX W1 N1 X WR 1 . their responses correspond to vertices of a kdimensional hypercube.. namely the vertex consisting of k 1’s and the vertex consisting of k 0’s. We could design an Rcategory Madaline by identifying R vertices in hiddenunit space and then classifying a pattern according to which of these vertices the hiddenunit response is closest to. For similar.3.. It used the fact that the 2p socalled maximallength shiftregister sequences [Peterson. Similarly. NEURAL NETWORKS RCategory Madalines and ErrorCorrecting Output Codes If there are k hidden units (k > 1) in a twolayer network.. more recent work see [Dietterich & Bakiri. 1962]. 147ﬀ] in a (2p − 1)dimensional Boolean space are mutually equidistant (for any integer p). et al. we need a more powerful classiﬁer. R MAX WR NR Figure 4.. W1 1 . The Madaline’s response is 1 if the point in “hiddenunitspace” is closer to the all 1’s vertex than it is to the all 0’s vertex.3 Piecewise Linear Machines A twocategory training set is linearly separable if there exists a threshold function that correctly classiﬁes all members of the training set. pp. 1991]. 1961. we can say that an Rcategory training set is linearly separable if there exists a linear machine that correctly classiﬁes all members of the training set. A machine using that idea was implemented in the early 1960s at SRI [Brain. 4. When an Rcategory problem is not linearly separable. A candidate is a structure called a piecewise linear (PWL) machine illustrated in Fig... ARG MAX . 4. The ordinary twocategory Madaline identiﬁes two special points in this space.50 CHAPTER 4.
” 4. Such a network is called a cascade network. although [Nilsson. NETWORKS OF TLUS 51 The PWL machine groups its weighted summing units into R banks corresponding to the R categories.) We show a 3dimensional sketch for a network of two TLUs in Fig.15: A Cascade Network . We can use an errorcorrection training algorithm similar to that used for a linear machine. L1 x L2 output L3 Figure 4.) Unfortunately.15 in which the TLUs are labeled by the linearly separable functions (of their inputs) that they implement. where k is the number of TLUs from which it receives inputs. An example is shown in Fig. there are example training sets that are separable by a given PWL machine structure but for which this errorcorrection training method fails to ﬁnd a solution. that’s 2k diﬀerent combinations—resulting in 2k diﬀerent positions for the parallel hyperplanes.16. An input vector X is assigned to that category corresponding to the bank with the largest weighted sum.3.3. (Each of the k preceding TLUs can have an output of 1 or 0.4 Cascade Networks Another interesting class of feedforward networks is that in which all of the TLUs are ordered and each TLU receives inputs from all of the pattern components and from all TLUs lower in the ordering. we use augmented vectors here. If a pattern is classiﬁed incorrectly. page 89] observed that “it is probably not a very eﬀective method for training PWL machines having more than three [weight vectors] in each bank. 1965. 4. 1966]. we subtract (a constant times) the pattern vector from the weight vector producing the largest dot product (it was incorrectly the largest) and add (a constant times) the pattern vector to that weight vector in the correct bank of weight vectors whose dot product is locally largest in that bank. The reader might consider how the ndimensional parity function might be implemented by a cascade network having log2 n TLUs. Each TLU in the network implements a set of 2k parallel hyperplanes. (Again. The method does appear to work well in some situations [Duda & Fossum. 4.4.
it is possible to have several TLUs in the output layer—each implementing a diﬀerent function. of course. The input feature vector is denoted by X(0) .16: Planes Implemented by a Cascade Network with Two TLUs Cascade networks might be trained by ﬁrst training L1 to do as good a job as possible at separating all the training patterns (perhaps by using the pocket algorithm. then training L2 (including the weight from L1 to L2 ) also to do as good a job as possible at separating all the training patterns.17 to introduce some notation. Consider.4. This network has only one output unit. If such a network makes an error on a pattern. we might have used V notation instead to include . and the ﬁnal output (of the kth layer TLU) is f . The jth layer of TLUs (1 ≤ j < k) will have as their outputs the vector X(j) . 1990]. 4.52 CHAPTER 4.4 4. we use Fig. Each of the layers of TLUs will have outputs that we take to be the components of vectors. just as the input features are components of an input vector. and so on until the resulting network classiﬁes the patterns in the training set satisfactorily. the WidrowHoﬀ method of gradient descent has been generalized to deal with multilayer networks. one looks for weightadjusting procedures that move the network in the correct direction (relative to the error) by making minimal changes. the layered. for example.1 Training Feedforward Networks by Backpropagation Notation The general problem of training a network of TLUs is diﬃcult. Intuitively. but. (We will assume that the “threshold weight” is the last component of the associated weight vector. In this spirit. the ith TLU in the jth layer has a weight vector denoted by (j) Wi . there are usually several diﬀerent ways in which the error can be corrected. feedforward network of Fig. 4.11. 4. for example). It is diﬃcult to assign “blame” for the error to any particular TLU in the network. Each TLU in each layer has a weight vector (connecting it to its inputs) and a threshold. NEURAL NETWORKS L2 L2 L1 Figure 4. In explaining this generalization. Also mention the “cascadecorrelation” method of [Fahlman & Lebiere.
4.4. TRAINING FEEDFORWARD NETWORKS BY BACKPROPAGATION53 this threshold component, but we have chosen here to use the familiar X,W notation, assuming that these vectors are “augmented” as appropriate.) We denote the weighted sum input to the ith threshold unit in the jth layer by (j) (j) (j) si . (That is, si = X(j−1) •Wi .) The number of TLUs in the jth layer is (j) (j) given by mj . The vector Wi has components wl,i for l = 1, . . . , m(j−1) + 1.
First Layer jth Layer (k1)th Layer kth Layer
X(0)
X(1)
X(j)
X(k1)
... ...
...
Wi(1) si(1)
wli(j)
...
Wi(j) s (j)
...
Wi(k1) s (k1)
wl(k)
W(k) f s(k)
...
. .i .
. .i .
m1 TLUs
mj TLUs
m(k1) TLUs
Figure 4.17: A klayer Network
4.4.2
The Backpropagation Method
A gradient descent method, similar to that used in the Widrow Hoﬀ method, has been proposed by various authors for training a multilayer, feedforward network. As before, we deﬁne an error function on the ﬁnal output of the network and we adjust each weight in the network so as to minimize the error. If we have a desired response, di , for the ith input vector, Xi , in the training set, Ξ, we can compute the squared error over the entire training set to be: ε=
Xi
(di − fi )2
Ξ
where fi is the actual response of the network for input Xi . To do gradient descent on this squared error, we adjust each weight in the network by an amount proportional to the negative of the partial derivative of ε with respect to that weight. Again, we use a singlepattern error function so that we can use an incremental weight adjustment procedure. The squared error for a single input vector, X, evoking an output of f when the desired output is d is:
54
CHAPTER 4. NEURAL NETWORKS
ε = (d − f )2 It is convenient to take the partial derivatives of ε with respect to the various weights in groups corresponding to the weight vectors. We deﬁne a partial (j) derivative of a quantity φ, say, with respect to a weight vector, Wi , thus: ∂φ
(j) ∂Wi (j) def
=
∂φ
(j) ∂w1i
,...,
∂φ
(j) ∂wli (j)
,...,
∂φ
(j) ∂wmj−1 +1,i
where wli is the lth component of Wi . This vector partial derivative of φ is called the gradient of φ with respect to W and is sometimes denoted by W φ. Since ε’s dependence on Wi rule to write:
(j)
is entirely through si , we can use the chain
(j) (j)
(j)
∂ε ∂Wi Because si
(j) (j)
=
∂ε ∂si
(j)
∂si
∂Wi
= X(j−1) •Wi ,
(j)
∂si
(j) (j)
∂Wi
= X(j−1) . Substituting yields: ∂ε ∂si
(j)
∂ε
(j) ∂Wi
=
X(j−1)
Note that
∂ε (j) ∂si
= −2(d − f ) ∂ε
∂f (j) . ∂si
Thus, ∂f ∂si
(j)
(j) ∂Wi
= −2(d − f )
X(j−1)
The quantity (d−f )
(j) δi .
∂f (j) ∂si
plays an important role in our calculations; we shall
(j)
denote it by Each of the δi ’s tells us how sensitive the squared error of the network output is to changes in the input to each threshold function. Since we will be changing weight vectors in directions along their negative gradient, our fundamental rule for weight changes throughout the network will be: Wi
(j) (j)
← Wi
(j)
+ ci δi X(j−1)
(j) (j)
where ci is the learning rate constant for this weight vector. (Usually, the learning rate constants for all weight vectors in the network are the same.) We see that this rule is quite similar to that used in the error correction procedure
4.4. TRAINING FEEDFORWARD NETWORKS BY BACKPROPAGATION55 for a single TLU. A weight vector is changed by the addition of a constant times its vector of (unweighted) inputs. Now, we must turn our attention to the calculation of the δi ’s. Using the deﬁnition, we have: δi
(j) (j)
= (d − f )
∂f ∂si
(j)
We have a problem, however, in attempting to carry out the partial derivatives of f with respect to the s’s. The network output, f , is not continuously diﬀerentiable with respect to the s’s because of the presence of the threshold functions. Most small changes in these sums do not change f at all, and when f does change, it changes abruptly from 1 to 0 or vice versa. A way around this diﬃculty was proposed by Werbos [Werbos, 1974] and (perhaps independently) pursued by several other researchers, for example [Rumelhart, Hinton, & Williams, 1986]. The trick involves replacing all the threshold functions by diﬀerentiable functions called sigmoids.1 The output of a sigmoid function, superimposed on that of a threshold function, is shown 1 in Fig. 4.18. Usually, the sigmoid function used is f (s) = 1+e−s , where s is the input and f is the output.
f (s)
threshold function
sigmoid f (s) = 1/[1 + e s] s
Figure 4.18: A Sigmoid Function
1 [Russell
& Norvig 1995, page 595] attributes the use of this idea to [Bryson & Ho 1969].
... Wi(k1) (k1) w (k) l fi (k1) i si(k1) W(k) f(k) s(k) (k) ..19: A Network with Sigmoid Units 4. m1 sigmoids mj sigmoids m(k1) sigmoids Figure 4... (j) The output of the ith sigmoid unit in the jth layer is denoted by fi ..) −s 1+e i First Layer jth Layer (k1)th Layer kth Layer X(0) X(1) X(j) X(k1) . 4. fi = (j) .4. .19.56 CHAPTER 4. . . fi(1) Wi(j) fi(j) (1) (j) wli(j) i i si(1) si(j) .. Wi(1) .... NEURAL NETWORKS We show the network containing sigmoid units in place of TLUs in Fig. (That (j) 1 is....3 Computing Weight Changes in the Final Layer We ﬁrst calculate δ (k) in order to compute the weight change for the ﬁnal sigmoid unit: ... .
4. The backpropagation weight adjustment for the single element in the ﬁnal layer can be written as: W ←− W + c(d − f )f (1 − f )X Written in the same format. For a pattern far away from this fuzzy hyperplane. and the backpropagation rule makes little or no change to the weight values regardless of the desired output. When f is 0. the weight vector in the ﬁnal layer is changed according to the rule: W(k) ← W(k) + c(k) δ (k) X(k−1) where δ (k) = (d − f (k) )f (k) (1 − f (k) ) It is interesting to compare backpropagation to the errorcorrection rule and to the WidrowHoﬀ rule. f (1 − f ) can vary in value from 0 to 1. (Small changes in the weights will have little eﬀect on the output for inputs far from the hyperplane. Substituting gives us: ∂s δ (k) = (d − f (k) )f (k) (1 − f (k) ) we have Rewriting our general rule for weight vector changes. namely f (s) = that ∂f = f (1 − f ). The sigmoid function can be thought of as implementing a “fuzzy” hyperplane. With the sigmoid function. and these changes are in the direction of correcting the error. . when the input to the sigmoid is 0).) Weight changes are only made within the region of “fuzz” surrounding the hyperplane.4. f (1 − f ) obtains its maximum value of 1/4 when f is 1/2 (that is. f (1 − f ) is also 0. Given the sigmoid function that we are using. TRAINING FEEDFORWARD NETWORKS BY BACKPROPAGATION57 δ (k) = (d − f (k) ) ∂f (k) ∂s(k) 1 1+e−s . just as in the errorcorrection and WidrowHoﬀ rules. f (1 − f ) is 0. the errorcorrection rule is: W ←− W + c(d − f )X and the WidrowHoﬀ rule is: W ←− W + c(d − f )X The only diﬀerence (except for the fact that f is not thresholded in WidrowHoﬀ) is the f (1 − f ) term due to the presence of the sigmoid function. when f is 1. f (1 − f ) has value close to 0.
depends on si through each of the summed inputs to the sigmoids in the (j + 1)th layer. we can similarly compute how to change each of the weight vectors in the network.4. f . since the weights do not depend on the s’s: ∂sl (j+1) (j) ∂ = mj +1 ν=1 fν wνl (j) (j) (j+1) mj +1 = ν=1 ∂si ∂si (j) ∂fν (j) wνl (j) (j+1) ∂fν (j) ∂si (j) ∂fν Now. NEURAL NETWORKS 4. we note that Therefore: ∂si = 0 unless ν = i.58 CHAPTER 4. To do that we ﬁrst write: (j+1) (j+1) = X(j) •Wl mj +1 = ν=1 (j) fν wνl (j+1) And then. So: δi (j) (j) = (d − f ) ∂f ∂si (j) = (d − f ) ∂f ∂s1 (j+1) ∂s1 (j+1) (j) + ··· + ∂f ∂sl (j+1) ∂sl (j+1) (j) + ··· + ∂f (j+1) ∂smj+1 (j) (j+1) ∂si ∂si ∂smj+1 ∂si (j+1) (j+1) ∂sl (j) ∂si mj+1 = l=1 (d − f ) ∂f ∂sl (j+1) ∂sl (j+1) (j) mj+1 = l=1 ∂si δl It remains to compute the sl ∂sl (j+1) (j) ∂si ’s. The ﬁnal output. (j) (j) ∂sl (j+1) (j) ∂si = wil (j+1) (j) fi (1 − fi ) (j) . in which case ∂sν (j) = fν (1 − fν ).4 Computing Changes to the Weights in Intermediate Layers Using our expression for the δ’s. Recall: δi (j) = (d − f ) ∂f ∂si (j) Again we use a chain rule.
it has an intuitively reasonable explanation.4. simulated annealing. 1986. Krogh. et al. 1991]).. namely: Wi (j) ← Wi (j) + ci δi X(j−1) (j) (j) Although this rule appears complex. regularization methods] Simulated Annealing To apply simulated annealing. see [Hertz. approaches either 0 or 1. If we fall early into an errorfunction valley that is not very deep (a local minimum). These calculations can be simply implemented by “backpropagating” the δ’s through the weights in reverse direction (thus. (It is interesting to note that this expression is independent of the error function. the value of the learning rate constant is gradually decreased with time. and soon . the adjustments for the weights going in to a sigmoid unit in the jth layer are proportional to the eﬀect that such adjustments have on that sigmoid unit’s output (its f (j) (1 − f (j) ) factor). TRAINING FEEDFORWARD NETWORKS BY BACKPROPAGATION59 We use this result in our expression for δi (j) δi (j) to give: δl (j+1) mj+1 = (j) fi (1 − (j) fi ) l=1 wil (j+1) The above equation is recursive in the δ’s. This average eﬀect depends on the weights going out of the sigmoid unit in the jth layer (small weights produce little downstream eﬀect) and the eﬀects that changes in the outputs of (j + 1)th layer sigmoid units will have on the ﬁnal output (as measured by the δ (j+1) ’s). we can use this equation to compute the δi ’s. & Palmer. The base case is δ (k) . momemtum (Plaut. They are also proportional to a kind of “average” eﬀect that any change in the output of that sigmoid unit will have on the ﬁnal output.) As the recursion equation for the δ’s shows. quickprop. it typically will neither be very broad. the error function explicitly (j+1) aﬀects only the computation of δ (k) .) Having computed the δi ’s for layer (j) j + 1. (Adjustments diminish as the ﬁnal output. 4. the name backprop for this algorithm).5 Variations on Backprop [To be written: problem of local minima. f . because they have vanishing eﬀect on f then.4. which we have already computed: δ (k) = (d − f (k) )f (k) (1 − f (k) ) We use this expression for the δ’s in our generic weight changing rule.4. The quantity δ (k) = (d − f )f (1 − f ) controls the overall amount and sign of all weight adjustments in the network.
60
CHAPTER 4. NEURAL NETWORKS
a subsequent large correction will jostle us out of it. It is less likely that we will move out of deep valleys, and at the end of the process (with very small values of the learning rate constant), we descend to its deepest point. The process gets its name by analogy with annealing in metallurgy, in which a material’s temperature is gradually decreased allowing its crystalline structure to reach a minimal energy state.
4.4.6
An Application: Steering a Van
A neural network system called ALVINN (Autonomous Land Vehicle in a Neural Network) has been trained to steer a Chevy van successfully on ordinary roads and highways at speeds of 55 mph [Pomerleau, 1991, Pomerleau, 1993]. The input to the network is derived from a lowresolution (30 x 32) television image. The TV camera is mounted on the van and looks at the road straight ahead. This image is sampled and produces a stream of 960dimensional input vectors to the neural network. The network is shown in Fig. 4.20.
sharp left
960 inputs 30 x 32 retina
...
centroid of outputs steers vehicle
straight ahead
...
5 hidden units connected to all 960 inputs sharp right 30 output units connected to all hidden units
Figure 4.20: The ALVINN Network The network has ﬁve hidden units in its ﬁrst layer and 30 output units in the second layer; all are sigmoid units. The output units are arranged in a linear order and control the van’s steering angle. If a unit near the top of the array of output units has a higher output than most of the other units, the van is steered to the left; if a unit near the bottom of the array has a high output, the van is steered to the right. The “centroid” of the responses of all of the output
4.5. SYNERGIES BETWEEN NEURAL NETWORK AND KNOWLEDGEBASED METHODS61 units is computed, and the van’s steering angle is set at a corresponding value between hard left and hard right. The system is trained by a modiﬁed online training regime. A driver drives the van, and his actual steering angles are taken as the correct labels for the corresponding inputs. The network is trained incrementally by backprop to produce the driverspeciﬁed steering angles in response to each visual pattern as it occurs in real time while driving. This simple procedure has been augmented to avoid two potential problems. First, since the driver is usually driving well, the network would never get any experience with farfromcenter vehicle positions and/or incorrect vehicle orientations. Also, on long, straight stretches of road, the network would be trained for a long time only to produce straightahead steering angles; this training would swamp out earlier training to follow a curved road. We wouldn’t want to try to avoid these problems by instructing the driver to drive erratically occasionally, because the system would learn to mimic this erratic behavior. Instead, each original image is shifted and rotated in software to create 14 additional images in which the vehicle appears to be situated diﬀerently relative to the road. Using a model that tells the system what steering angle ought to be used for each of these shifted images, given the driverspeciﬁed steering angle for the original image, the system constructs an additional 14 labeled training patterns to add to those encountered during ordinary driver training.
4.5
Synergies Between Neural Network and KnowledgeBased Methods Bibliographical and Historical Remarks
To be written; discuss rulegenerating procedures (such as [Towell & Shavlik, 1992]) and how expertprovided rules can aid neural net training and viceversa [Towell, Shavlik, & Noordweier, 1990]. To be added.
4.6
62
CHAPTER 4. NEURAL NETWORKS
X. suppose we have the two probability distributions (perhaps probability density functions). p(X  1) and p(X  2).1. if we decide category i. (The treatment given here can easily be generalized to Rcategory problems. j = 1.1 Using Statistical Decision Theory Background and General Method Suppose the pattern vector. its category is j. Given a pattern. is a random variable whose probability distribution for category 1 is diﬀerent than it is for category 2. we want to use statistical techniques to determine its category—that is. 2. Our decision rule will be to decide that X belongs to category 1 if LX (1) ≤ LX (2). For any given pattern.Chapter 5 Statistical Learning 5. for i. These techniques are based on the idea of minimizing the expected value of a quantity similar to the error function we used in deriving the weightchanging rules for backprop. (We might decide that a pattern really in category 1 is in category 2. In developing a decision method.1 5. Given a pattern. the expected value of the loss will be: LX (i) = λ(i  1)p(1  X) + λ(i  2)p(2  X) where p(j  X) is the probability that given a pattern X. it is necessary to know the relative seriousness of the two kinds of mistakes that might be made. we want to decide its category in such a way that minimizes the expected value of this loss.) We describe this information by a loss function. and to decide on category 2 otherwise. X. to determine from which distribution it was drawn. X. 63 . λ(i  j) represents the loss incurred when we decide a pattern is in category i when really it is in category j. and vice versa. λ(i  j). We assume here that λ(1  1) and λ(2  2) are both 0.) Speciﬁcally. X.
then our decision rule is simply. this simple decision rule implements what is called a maximumlikelihood decision. and p(X) is the (a priori) probability of pattern X being the pattern we are asked to classify. which we assume to be known (or estimatible): p(j  X) = p(X  j)p(j) p(X) where p(j) is the (a priori) probability of category j (one category may be much more probable than the other). Decide category 1 iﬀ: λ(1  2)p(X  2)p(2) ≤ λ(2  1)p(X  1)p(1) If λ(1  2) = λ(2  1) and if p(1) = p(2). if we deﬁne k(i  j) as λ(i  j)p(j). then the decision becomes particularly simple: Decide category 1 iﬀ: p(X  2) ≤ p(X  1) Since p(X  j) is called the likelihood of j with respect to X. More generally. we need to compare the (perhaps weighted) quantities p(X  i) for i = 1 and 2. Performing the substitutions given by Bayes’ Rule. The exact decision rule depends on the the probability distributions assumed. and noticing that p(X) is common to both expressions. we obtain. our decision rule becomes: Decide category 1 iﬀ: λ(1  1) p(X  2)p(2) p(X  1)p(1) + λ(1  2) p(X) p(X) p(X  1)p(1) p(X  2)p(2) + λ(2  2) p(X) p(X) ≤ λ(2  1) Using the fact that λ(i  i) = 0. Decide category1 iﬀ: k(1  2)p(X  2) ≤ k(2  1)p(X  1) In any case. . We will treat two interesting distributions. STATISTICAL LEARNING We can use Bayes’ Rule to get expressions for p(j  X) in terms of p(X  j).64 CHAPTER 5.
1. 5. Σ. one class tends to have patterns clustered around one point in the ndimensional space. its value is always positive). (In that ﬁgure.1. and Σ is the determinant of the covariance matrix.) What decision rule should we use to separate patterns into the two appropriate categories? Substituting the Gaussian distributions into our maximum likelihood formula yields: . In general the principal axes are given by the eigenvectors of Σ. . We show a twodimensional instance of this problem in Fig.2. .1. is the expected value of X (using this distribution).2 Gaussian (or Normal) Distributions The multivariate (ndimensional) Gaussian distribution is given by the probability density function: p(X) = e (2π)n/2 Σ1/2 1 −1 −(X−M)t Σ (X−M) 2 where n is the dimension of the column vector X. the equiprobability contours are all centered on the mean vector. (X − M)t is the transpose of the vector (X − M). an intuitive idea for Gaussian distributions can be given when n = 2.5. The mean vector. the covariance matrix. A threedimensional plot of the distribution is shown at the top of the ﬁgure. Suppose we now assume that the two classes of pattern vectors that we want to distinguish are each distributed according to a Gaussian distribution but with diﬀerent means and covariance matrices. Σ is the covariance matrix of the distribution (an n × n symmetric. M. Although the formula appears complex. positive deﬁnite matrix). mn ). The components of the covariance matrix are given by: 2 σij = E[(xi − mi )(xj − mj )] 2 In particular. We show a twodimensional Gaussian distribution in Fig. In any case. the column vector M is called the mean vector. thus equiprobability contours are hyperellipsoids in ndimensional space. is such that the elliptical contours of equal probability are skewed. USING STATISTICAL DECISION THEORY 65 5. that is if all oﬀdiagonal terms were 0. 5. σii is called the variance of xi . the formula in the exponent in the Gaussian distribution is a positive deﬁnite quadratic form (that is. that is. That is. which in our ﬁgure happens to be at the origin. . In this case. Σ−1 is the inverse of the covariance matrix. . we have plotted the sum of the two distributions. and the other class tends to have patterns clustered around another point. then the major axes of the elliptical contours would be aligned with the coordinate axes. If the covariance matrix were diagonal. with components (m1 . M. and contours of equal probability are shown at the bottom. In general. M = E[X].
After some manipulation.5 0. The result of the comparison isn’t changed if we compare logarithms instead.5 .75 75 0. STATISTICAL LEARNING p(x . our decision rule is then: . and the category 2 patterns are distributed with mean and covariance M2 and Σ2 .1: The TwoDimensional Gaussian Distribution Decide category 1 iﬀ: 1 (2π)n/2 Σ is less than or equal to 2 1/2 e−1/2(X−M2 ) t Σ2 (X−M2 ) −1 t −1 1 e−1/2(X−M1 ) Σ1 (X−M1 ) (2π)n/2 Σ1 1/2 where the category 1 patterns are distributed with mean and covariance M1 and Σ1 .66 CHAPTER 5. respectively.25 25 0 5 0 5 5 5 0 x 1 x 2 6 4 2 0 2 4 6 6 4 2 6 4 2 x 2 4 6 1 0 2 4 6 Figure 5.x ) 1 2 x 2 1 0.
5 5 5 2.5 0.5 5 2. a constant bias term. the decision rule involves a quadric surface (a hyperquadric) in ndimensional space. etc. then the contours of equal probability for each of the two distributions .75 75 0.5 . It is interesting to look at a special case of this surface.2: The Sum of Two Gaussian Distributions Decide category 1 iﬀ: (X − M1 )t Σ−1 (X − M1 ) < (X − M2 )t Σ−1 (X − M2 ) + B 1 2 where B.x ) 1 2 x 2 1 0. When the quadratic forms are multiplied out and represented in terms of the components xi .5 0 2.5.25 25 0 5 0 5 10 5 0 10 5 x 1 10 7. with all σii equal to each other. The exact shape and position of this hyperquadric is determined by the means and the covariance matrices.5 10 Figure 5. The surface separates the space into two parts. USING STATISTICAL DECISION THEORY 67 p(x . one of which contains points that will be assigned to category 1 and the other contains points that will be assigned to category 2.5 5 7.5 0 2. incorporates the logarithms of the fractions preceding the exponential.1. If the covariance matrices for each category are identical and diagonal.
if there are suﬃcient training patterns. there are various techniques for estimating them.3 Conditionally Independent Binary Components Suppose the vector X is a random variable having binary (0. and the decision rule is: Decide category 1 iﬀ: (X − M1 )t (X − M1 ) < (X − M2 )t (X − M2 ) Multiplying out yields: X•X − 2X•M1 + M1 •M1 < X•X − 2X•M2 + M2 •M2 or ﬁnally. In fact. one can use sample means and sample covariance matrices. We see that the optimal decision surface in this special case is a hyperplane.1. For example. If the parameters (Mi . We continue to denote the two probability distributions by p(X  1) and p(X  2). Decide category 1 iﬀ: X•M1 ≥ X•M2 + Constant or X•(M1 − M2 ) ≥ Constant where the constant depends on the lengths of the mean vectors.1) components. if the number of training patterns is less than n.) 5. STATISTICAL LEARNING are hyperspherical. The quadric forms then become (1/Σ)(X − Mi )t (X − Mi ). for example. and then using those estimates in the decision rule. Σi ) of the probability distributions of the categories are not known. The weights in a TLU implementation are equal to the diﬀerence in the mean vectors. we mean that the formulas for the distribution can be expanded as follows: . the hyperplane is perpendicular to the line joining the two means. (Caution: the sample covariance matrix will be singular if the training patterns happen to lie on a subspace of the whole ndimensional space—as they certainly will.68 CHAPTER 5. By conditional independence in this case. Further suppose that the components of these vectors are conditionally independent given the category.
5. we note that since xi can only assume the values of 1 or 0: log p(xi  1) pi (1 − pi ) = xi log + (1 − xi ) log p(xi  2) qi (1 − qi ) . xi : p(xi = 1  1) = pi p(xi = 0  1) = 1 − pi p(xi = 1  2) = qi p(xi = 0  2) = 1 − qi Now. we obtain. Decide category 1 iﬀ: p(1)p(x1  1)p(x2  1) · · · p(xn  1) ≥ p(x1  2)p(x2  2) · · · p(xn  2)p(2) or iﬀ: p(2) p(x1  1)p(x2  1) . 2 Recall the minimumaverageloss decision rule. . . Decide category 1 iﬀ: λ(1  2)p(X  2)p(2) ≤ λ(2  1)p(X  1)p(1) Assuming conditional independence of the components and that λ(1  2) = λ(2  1).1. USING STATISTICAL DECISION THEORY 69 p(X  i) = p(x1  i)p(x2  i) · · · p(xn  i) for i = 1. p(xn  2) p(1) or iﬀ: log p(x1  1) p(x2  1) p(xn  1) p(1) + log + · · · + log + log ≥0 p(x1  2) p(x2  2) p(xn  2) p(2) Let us deﬁne values of the components of the distribution for speciﬁc values of their arguments. . . p(xn  1) ≥ p(x1  2)p(x2  2) .
a knearestneighbor method assigns a new pattern. . n. qi and p(1). . These are called nearestneighbor methods or. Of course the denser are the points around X. More precisely.2 To be added. STATISTICAL LEARNING = xi log (1 − pi ) pi (1 − qi ) + log qi (1 − pi ) (1 − qi ) Substituting these expressions into our decision rule yields: Decide category 1 iﬀ: n xi log i=1 pi (1 − qi ) p(1) (1 − pi ) + + log ≥0 log qi (1 − pi ) i=1 (1 − qi ) p(2) n We see that we can achieve this decision with a TLU with weight values as follows: wi = log for i = 1. a nearestneighbor procedure decides that some new pattern. 1991]. X. we can use a sample of labeled training patterns to estimate these parameters. 5. sometimes. But large values of k also reduce the acuity of the method.70 CHAPTER 5. belongs to the same category as do its closest neighbors in Ξ. Using relatively large values of k decreases the chance that the decision will be unduly inﬂuenced by a noisy training pattern close to X. (A collection of papers on this subject is in [Dasarathy.3 Another class of methods can be related to the statistical ones. . . The knearestneighbor method can be thought of as estimating the values of the probabilities of the classes given X. and the larger the value of k. to that category to which the plurality of its k closest neighbors belong. X.) Given a training set Ξ of m labeled patterns. the better the estimate. . and wn+1 = log p(1) (1 − pi ) + log 1 − p(1) i=1 (1 − qi ) n pi (1 − qi ) qi (1 − pi ) If we do not know the pi . Learning Belief Networks NearestNeighbor Methods 5. memorybased methods.
Moore. the method and its derivatives have seen several practical applications. Since memory cost is now reasonably low. Also. . 1992. n class of training pattern 3 3 3 2 2 3 3 1 training pattern 3 2 1 1 2 2 2 2 3 3 3 2 2 1 1 1 1 1 1 X (a pattern to be classified) k=8 four patterns of category 1 two patterns of category 2 two patterns of category 3 plurality are in category 1. The CoverHart theorem states that under very mild conditions (having to do with the smoothness See [Baum. where p(i) is the a priori probability of category i. 1994] for theoretical analysis of error rate as a function of the number of training patterns for the case in which points are randomly distributed on the surface of a unit sphere and underlying function is linearly separable. where aj is the scale factor for dimension j. NEARESTNEIGHBOR METHODS 71 The distance metric used in nearestneighbor methods (for numerical attributes) can be simple Euclidean distance. . This distance measure is often modiﬁed by scaling the features so that the spread of attribute values along each dimension is approximately the same. 1977]. and p(X  i) is the probability (or probability density function) of X given that X belongs to category i. x2n ) is j=1 (x1j − x2j ) .3. . [Moore. the distance between two 2 patterns (x11 .3: An 8NearestNeighbor Decision Nearestneighbor methods are memory intensive because a large number of training patterns must be stored to achieve good generalization. x22 . x12 . .3. . That is.. the distance calculations required to ﬁnd nearest neighbors can often be eﬃciently computed by kdtree methods [Friedman.5. . . 1994]. An example of a nearestneighbor decision problem is shown in Fig. n 2 2 the distance between the two vectors would be j=1 aj (x1j − x2j ) . so decide X is in category 1 Figure 5. for categories i = 1. 1967] relates the performance of the 1nearestneighbor method to the performance of a minimumprobabilityoferror classiﬁer. 5. As mentioned earlier. for example. . R. et al. . . In the ﬁgure the class of a training pattern is indicated by the number next to it. Suppose the probability of error in classifying patterns of such a minimumprobabilityoferror classiﬁer is ε. (See. A theorem by Cover and Hart [Cover & Hart. In that case. . . . et al. the minimumprobabilityoferror classiﬁer would assign a new pattern X to that category that maximized p(i)p(X  i).. x1n ) and (x21 .
5. of a 1nearestneighbor classiﬁer is bounded by: ε ≤ εnn ≤ ε 2 − ε Also see [Aha.72 CHAPTER 5. STATISTICAL LEARNING of probability density functions) the probability of error. εnn . Bibliographical and Historical Remarks . R R−1 ≤ 2ε where R is the number of categories.4 To be added. 1991].
In Quinlan’s terminology. We show an example in Fig. A decision tree assigns a class number (or output) to an input pattern by ﬁltering the pattern down through the tests in the tree. and the rightmost one assigns the pattern to class 1. but will use the words “category” and “class” interchangeably to refer to what Quinlan calls “class. test T2 in the tree of Fig. (If all of the tests have two outcomes.Chapter 6 Decision Trees 6. The tests might have two outcomes or more than two.1 has three outcomes. The tests might be multivariate (testing on several features of the input at once) or univariate (testing on only one of the features). Each test has mutually exclusive and exhaustive outcomes. (Binaryvalued ones can be regarded as either. We follow the usual convention of depicting the leaf nodes by the class number. There are several dimensions along which decision trees might diﬀer: a.) c.1. Quinlan distinguishes between classes and categories. categorically valued functions. 6. We will not make this distinction. the middle one sends the input pattern down to test T4 .1 Note that in discussing decision trees we are not limited to implementing Boolean functions—they are useful for general. the leftmost one assigns the input pattern to class 3. however. He calls the subsets of patterns that ﬁlter down to each tip categories and subsets of patterns having the same label classes.) 1 One of the researchers who has done a lot of work on learning decision trees is Ross Quinlan. The features or attributes might be categorical or numeric.1 Deﬁnitions A decision tree (generally deﬁned) is a tree whose internal nodes are tests (on input patterns) and whose leaf nodes are categories (of patterns). For example. we have a binary decision tree. 6. b. our example tree has nine categories and three classes.” 73 .
3.5 [Quinlan. We show an example in Fig. we branch left. The DNF form implemented by such a tree can be obtained by tracing down each path leading to a tip node corresponding to an output value of 1. although incremental ones have also been proposed [Utgoﬀ. forming the conjunction of the tests along this path. Prominent among these are ID3 and its new version. C4. the tree implements a Boolean function. The kDL class of Boolean functions can be implemented by a multivariate decision tree having the (highly unbalanced) form shown in Fig.1: A Decision Tree d.2 Supervised Learning of Univariate Decision Trees Several systems for learning decision trees have been proposed. 1993]. each nonleaf node is depicted by a single attribute. and is called a Boolean decision tree. 6. Quinlan.2. The vi all have values of 0 or 1. We might have two classes or more than two. 1989]. 6. It is straightforward to represent the function implemented by a univariate Boolean decision tree in DNF form.. In drawing univariate decision trees.74 CHAPTER 6. 1986. is a term of size k or less. and CART [Breiman. and then taking the disjunction of these conjunctions. 6. 1984] We discuss here only batch methods. DECISION TREES T1 T2 T4 T3 3 1 T4 1 2 3 T4 2 3 2 1 Figure 6. Each test. ci . if it has value 1. If the attribute has value 0 in the input pattern. If we have two classes and binary inputs. . et al. we branch right.
1 Selecting the Type of Test As usual. but nonbinary. the tests might be formed by dividing the attribute values into mutually exclusive and exhaustive subsets. A decision tree with such tests is shown in Fig. If the attributes are numeric. 6. 6. The entropy or uncertainty still remaining about the class of a pattern— knowing that it is in some set.2. If the attributes are categorical. SUPERVISED LEARNING OF UNIVARIATE DECISION TREES 75 x3 0 x2 0 0 x3x2 1 0 x3x4 1 x3x2 1 x3 x4 x1 f = x3x2 + x3x4x1 0 x1 1 0 x3x4x1 1 x4 0 1 Figure 6. we have n features or attributes. of patterns is deﬁned as: H(Ξ) = − i p(iΞ) log2 p(iΞ) . we must also decide what the tests should be (besides selecting the order).” for example 7 ≤ xi ≤ 13.2.2: A Decision Tree Implementing a DNF Function 6. the most popular one is at each stage to select that test that maximally reduces an entropylike measure.4.2. the tests are simply whether the attribute’s value is 0 or 1.6. Extension to multipleoutcome tests is straightforward computationally but gives poor results because entropy is always decreased by having more outcomes. Ξ. For categorical and numeric attributes. Several techniques have been tried. If the attributes are binary.2. the tests might involve “interval tests.2 Using Uncertainty Reduction to Select Tests The main problem in learning decision trees for the binaryattribute case is selecting the order of the tests. We show how this technique works for the simple case of tests with binary outcomes.
we estimate them by sample statistics. Ξ1 . We want to select tests at each node such that as we travel down the decision tree. . If we perform a test. from now on we’ll drop the “hats” and use sample statistics as if they were real probabilities. k. T . Let p(iΞ) be the number of patterns ˆ in Ξ belonging to class i divided by the total number of patterns in Ξ.) If we knew that T applied to a pattern in Ξ resulted in the jth outcome (that is.76 CHAPTER 6. .. Ξk . . . they are nevertheless useful in estimating uncertainties. Suppose that ni of the patterns in Ξ are in Ξi for i = 1. . we knew that the pattern was in Ξj ). Then an estimate of the uncertainty is: ˆ H(Ξ) = − i p(iΞ) log2 p(iΞ) ˆ ˆ For simplicity.3: A Decision Tree Implementing a Decision List where p(iΞ) is the probability that a pattern drawn at random from Ξ belongs to class i. Although these estimates might be errorful. Ξ2 . having k possible outcomes on the patterns in Ξ... DECISION TREES cq cq1 vn ci vn1 vi 1 v1 Figure 6. we will create k subsets. the uncertainty about the class of a pattern becomes less and less. and the summation is over all of the classes. (Some ni may be 0. Since we do not in general have the probabilities p(iΞ). the uncertainty about its class would be: H(Ξj ) = − i p(iΞj ) log2 p(iΞj ) and the reduction in uncertainty (beyond knowing only that the pattern was in Ξ) would be: .
or d {a. SUPERVISED LEARNING OF UNIVARIATE DECISION TREES x3 = a. Again. The estimate p(Ξj ) of p(Ξj ) is just the number of those ˆ patterns in Ξ that have outcome j divided by the total number of patterns in Ξ. c. f} 2 77 x1 = e. we don’t know the probabilities p(Ξj ). and the sum is taken from 1 to k.6. c} {b} {d} 2 x4 = a.b} 1 1 x2 = a. The uncertainty calculations are particularly simple when the tests have binary outcomes and when the attributes have . and then applies this criterion recursively until some termination condition is met (which we shall discuss in more detail later). The average reduction in uncertainty achieved by test T (applied to patterns in Ξ) is then: RT (Ξ) = H(Ξ) − E[HT (Ξ)] An important family of decision tree learning algorithms selects for the root of the tree that test that gives maximum reduction of uncertainty. or d {e. b.2. by: E[HT (Ξ)] = j p(Ξj )H(Ξj ) where by HT (Ξ) we mean the average uncertainty after performing test T on the patterns in Ξ. p(Ξj ) is the probability that the test has outcome j. g} 2 Figure 6. or g {a} {g} 1 {d} {a. But we can estimate the average uncertainty over all the Ξj . f. or g {e. b. e.4: A Decision Tree with Categorical Attributes H(Ξ) − H(Ξj ) Of course we cannot say that the test T is guaranteed always to produce that amount of reduction in uncertainty because we don’t know that the result of the test will be the jth outcome. but we can use sample values.
0) (1. 0. 1) (1. should be performed ﬁrst? The illustration in Fig. 0. We’ll give a simple example to illustrate how the test selection mechanism works in that case. 1) (0. 0) (0. x2 . we calculate the uncertainty reduction if we perform x1 ﬁrst. x1 . or x3 . 0) (1. 1. 0.5 gives geometric intuition about the problem. 1) class 0 0 0 0 0 1 0 1 What single test. 0) (0. 1. containing all eight points is: H(Ξ) = −(6/8) log2 (6/8) − (2/8) log2 (2/8) = 0. The lefthand branch has only patterns belonging to class 0 (we call them the set Ξl ). So. 1) (1. 0.5: Eight Patterns to be Classiﬁed by a Decision Tree The initial uncertainty for the set. DECISION TREES binary values. and the righthandbranch (Ξr ) has two patterns in each class. 1.78 CHAPTER 6. Ξ.81 Next. Suppose we want to use the uncertaintyreduction method to build a decision tree to classify the following patterns: pattern (0. 6. x3 x2 The test x1 x1 Figure 6. 1. the uncertainty of the lefthand branch is: .
The decision tree that this procedure creates thus implements the Boolean function: f = x1 x3 . we can measure the uncertainty reduction for each test that is achieved by every possible threshold (there are only a ﬁnite number of thresholds that give diﬀerent test results if there are only a ﬁnite number of training patterns). we can still use the uncertaintyreduction technique to select tests. 1986.3 NonBinary Attributes If the attributes are nonbinary. We can calculate the uncertainty reduction for each split.5 Therefore the uncertainty reduction on Ξ achieved by x1 is: Rx1 (Ξ) = 0. our “greedy” algorithm for selecting a ﬁrst test would select either x1 or x3 . sect. we must select a test on that attribute.6. Suppose for example that the value of an attribute is a real number and that the test to be performed is to set a threshold and to test to see if the number is greater than or less than that threshold. Thus. Thus.6. The uncertaintyreduction procedure would select x3 as the next test.5 = 0. But now.3 Networks Equivalent to Decision Trees Since univariate Boolean decision trees are implementations of DNF functions. 6. 6. they are also equivalent to twolayer. See [Quinlan.31 By similar calculations. feedforward neural networks. if an attribute is categorical (with a ﬁnite number of categories). 4] for another example. The decision tree at the left of the ﬁgure implements .2. in addition to selecting an attribute. given a set of labeled patterns. We show an example in Fig. In principle. NETWORKS EQUIVALENT TO DECISION TREES 79 Hx1 (Ξl ) = −(4/4) log2 (4/4) − (0/4) log2 (0/4) = 0 And the uncertainty of the righthand branch is: Hx1 (Ξr ) = −(2/4) log2 (2/4) − (2/4) log2 (2/4) = 1 Half of the patterns “go left” and half “go right” on test x1 . we see that the test x3 achieves exactly the same uncertainty reduction. Similarly. 6. but x2 achieves no reduction whatsoever. Suppose x1 is selected.81 − 0. there are only a ﬁnite number of mutually exclusive and exhaustive subsets into which the values of the attribute can be split.3. the average uncertainty after performing the x1 test is: 1/2Hx1 (Ξl ) + 1/2Hx1 (Ξr ) = 0.
and (for a special case) by [Marchand & Golea. it is possible to reduce the subset of functions that are consistent with the training set suﬃciently to make useful guesses about the value of the function for inputs not in the training set. but the weights in the ﬁrst two layers must be trained. even with an incomplete set of training samples.80 CHAPTER 6. then. Of course. are indicated by L1 . 6. When we know a priori that the function we are trying to guess belongs to a small subset of all possible functions. all of the features are evaluated in parallel for any input pattern. whereas when implemented as a decision tree only those features on the branch traveled down by the input pattern need to be evaluated. L3 .5 terms disjunction 0 x3x4x1 x3x4x1 x3x2 Figure 6. each implemented by a TLU. f = x3x2 + x3x4x1 1 x1 x3 0 x2 0 0 x3x2 1 x3x2 1 0 x3x4 x3x4x1 1 +1 +1 1. by [John. We show an example in Fig. we must choose a function to ﬁt the training set from among a set of hypotheses.7 in which the linearly separable functions. 1995]. Again. and L4 .4 6. Different approaches to training procedures have been discussed by [Brent.1 Overﬁtting and Evaluation Overﬁtting In supervised learning. L2 . DECISION TREES the same function as the network at the right of the ﬁgure. 1993].6: A Univariate Decision Tree and its Equivalent Network Multivariate decision trees with linearly separable functions at each node can also be implemented by feedforward networks—in this case threelayer ones. when implemented as a network.5 x2 0 x4 1 x1 0 1 1 X x3 x4 f +1 1 0. 6. And.4. We have already showed that generalization is impossible without bias. 1990]. The decisiontree induction methods discussed in this chapter can thus be thought of as particular ways to establish the structure and the weight values for networks. the ﬁnal layer has ﬁxed weights. .
if we are comparing several learning systems (for example.” True. with a consequent expected improvement in generalization. Since a decision tree of suﬃcient size can implement any Boolean function there is a danger of overﬁtting—especially if the training set is small. However. They make use of methods for estimating how well a given decision tree might generalize—methods we shall describe next. and generalization performance will be poor. it might perform poorly on new patterns that were not used to build the decision tree. we say that we are overﬁtting the training data.7: A Multivariate Decision Tree and its Equivalent Network the larger the training set. training on the test data enlarges the training set. OVERFITTING AND EVALUATION 81 L1 0 L2 0 0 1 1 0 1 0 L4 1 1 L3 1 L1L2 L1 L2 X L3 L4 + + + disjunction 0 f 0 L1 L3 L4 conjunctions Figure 6. Several techniques have been proposed to avoid overﬁtting. if the training set is not suﬃciently large compared with the size of the hypothesis space. but there is still the danger of overﬁtting if we are comparing several diﬀerent learning systems. When there are too many hypotheses that are consistent with the training set. if we are comparing diﬀerent decision trees) so that we can select the one that performs the best on the test set.2 Validation Methods The most straightforward way to estimate how well a hypothesized function (such as a decision tree) performs on a test set is to test it on the test set! But. the more likely it is that even a randomly selected consistent function will have appropriate outputs for patterns not yet seen. That is.6. even with bias. Overﬁtting is a problem that we must address for all learning methods. and we shall examine some of them here.4. Another technique is to . then such a comparison amounts to “training on the test data.4. even if the decision tree is synthesized to classify all the members of the training set correctly. there will still be too many consistent functions for us to make useful guesses. 6.
For these nodes.8. Describe “bootstrapping” also [Efron. This procedure will result in a few errors but often accepting a small number of errors on the training set results in fewer errors on a testing set. (The error rate is the number of classiﬁcation errors made on Ξi divided by the number of patterns in Ξi . but we can decide in favor of the most numerous class.3 Avoiding Overﬁtting in Decision Trees Near the tips of a decision tree there may be only a few patterns per node. This problem can be dealt with by terminating the testgenerating procedure before all patterns are perfectly split into their separate categories. 1982]. on Ξi . train on the union of all of the other subsets. though. we are selecting a test based on a very small sample. εi . 6. DECISION TREES split the training set—using (say) twothirds for training and the other third for estimating generalization performance. That is. and each Ξi consists of a single pattern. Leaveoneout Validation Leaveoneout validation is the same as cross validation for the special case in which K equals the number of patterns in Ξ. more expensive computationally. .) An estimate of the error rate that can be expected on new patterns of a classiﬁer trained on all the patterns in Ξ is then the average of the εi . Ξi . 6. We count the total number of mistakes and divide by K to get the estimated error rate. CrossValidation In crossvalidation. But splitting reduces the size of the training set and thereby increases the possibility of overﬁtting. we simply note whether or not a mistake was made. This behavior is illustrated in Fig. This type of validation is. . we divide the training set Ξ into K mutually exclusive and exhaustive equalsized subsets: Ξ1 . . One can use crossvalidation techniques to determine when to stop splitting nodes. because underﬁtting usually leads to more errors on test sets than does overﬁtting. If the cross validation error increases as a consequence of a node split. a leaf node may contain patterns of more than one class. There is a general rule that the lowest errorrate attainable by a subtree of a fully expanded tree can be no less than 1/2 of the error rate of the fully expanded tree [Weiss & Kulikowski. . ΞK . and empirically determine the error rate. When testing on each Ξi . For each subset. page 126].82 CHAPTER 6. 1991. then don’t split. of course. and thus we are likely to be overﬁtting. but useful when a more accurate estimate of the error rate for a classiﬁer is needed. We next describe some validation techniques that attempt to avoid these problems. One has to be careful about when to stop. .4.
Or. Assuming equally probable classes. This technique is called postpruning. In between these extremes. 1991) Figure 6. each labeled by one of R classes. If there are m patterns.4 0. C. OVERFITTING AND EVALUATION 83 Error Rate 1. S. one could transmit a list of m Rvalued numbers. Consider the problem of transmitting just the labels of a training set of patterns.8: Determining When Overﬁtting Begins Rather than stopping the growth of a decision tree. 6. the number of bits (or description length of the binary encoded message) is t + d. we might transmit a tree plus a list of labels of all the patterns that the tree misclassiﬁes. it might be more economical to transmit the tree than to transmit the labels directly. 1978]. one might grow it to its full size and then prune away leaf nodes and their ancestors until crossvalidation accuracy no longer increases..4. this transmission would require m log2 R bits. Morgan Kaufmann. assuming that the receiver of this information already has the ordered set of patterns. 1991]. (MDL is an important idea that extends beyond decisiontree methods [Rissanen. Computer Systems that Learn. one could transmit a decision tree that correctly labelled all of the patterns. In general.6 0.) The idea is that the simplest decision tree that can predict the classes of the training patterns is the best one.6.2 0 0 1 2 3 4 5 6 validation errors 7 8 9 Number of Terminal Nodes (From Weiss.8 Iris Data Decision Tree training errors 0. Various techniques for pruning are discussed in [Weiss & Kulikowski.. and d is the length of the message required to transmit the labels of . If the tree is small and accurately classiﬁes all of the patterns.4 MinimumDescription Length Methods An important treegrowing and pruning technique is based on the minimumdescriptionlength (MDL) principle. The number of bits that this transmission would require depends on the technique for encoding decision trees and on the size of the tree. where t is the length of the message required to transmit the tree. and Kulikowski.0 0.4.
Quinlan and Rivest [Quinlan & Rivest. 1992. 6. although they are quite sensitive to the coding schemes used. The DNFform equivalent to the function implemented by this decision tree is f = x1 x2 + x1 x2 x3 x4 + x1 x3 x4 . Kohavi. Refusal to tolerate errors on the training set when there is noise leads to the problem of “ﬁtting the noise. 1994]. A decision graph that implements the same decisions as that of the decision tree of Fig. they prune away those nodes (starting at the tips) that achieve a decrease in the description length. The need for replication means that it takes longer to learn the tree and that subtrees replicated further down the tree must be learned using a smaller training subset. they use the reduction in description length to select tests (instead of reduction in uncertainty).5 Noise in Data Noise in the data means that one must inevitably accept some number of errors—depending on the noise level. In growing a tree. the function f = x1 x2 + x3 x4 . Several approaches might be suggested for dealing with fragmentation. & Wallace. 1989] have proposed techniques for encoding decision trees and lists of exception labels and for calculating the description length (t+d) of these trees and labels. A decision tree for this function is shown in Fig. The MDL method is one way of adhering to the Occam’s razor principle.4.10. 6. 6. then.” Dealing with noise. 6. Another approach is to use multivariate (rather than univariate tests at each node). These techniques compare favorably with the uncertaintyreduction method. One is to attempt to build a decision graph instead of a tree [Oliver. In pruning a tree after it has been grown to zero error. 6.5 The Problem of Replicated Subtrees Decision trees are not the most economical means of implementing some Boolean functions. b. In our example of learning f = x1 x2 + x3 x4 . requires accepting some errors at the leaf nodes just as does the fact that there are a small number of patterns at leaf nodes. that tree associated with the smallest value of t + d is the best or most economical tree. This problem is sometimes called the fragmentation problem. This DNF form is nonminimal (in the number of disjunctions) and is equivalent to f = x1 x2 + x3 x4 . Notice the replicated subtrees shown circled. Consider.9 is shown in Fig.84 CHAPTER 6. DECISION TREES the patterns misclassiﬁed by the tree. Dowe. They then use the description length as a measure of quality of a tree in two ways: a.9. In a sense. if we had a test for x1 x2 . for example.
11.” A third method for dealing with the replicated subtree problem involves extracting propositional “rules” from the decision tree. The rules will have as antecedents the conjunctions that lead down to the leaf nodes. and care must be taken that the active rules do not conﬂict in their decision about the class of a pattern. After a rule set is processed. Quinlan [Quinlan.9: A Decision Tree with Subtree Replication and a test for x3 x4 . A conjunct or rule is determined to be unnecessary if its elimination has little eﬀect on classiﬁcation accuracy—as determined by a chisquare test. it might be the case that more than one rule is “active” for any given pattern. 6. the decision tree could be much simpliﬁed. 1987] discusses methods for reducing a set of rules to a simpler set by 1) eliminating from the antecedent of each rule any “unnecessary” conjuncts. 1995] gives a nice overview (with several citations) of learning such linear discriminant trees and presents a method based on “soft entropy. and then 2) eliminating “unnecessary” rules. .6. and as consequents the name of the class at the corresponding leaf node. [John. Several researchers have proposed techniques for learning decision trees in which the tests at each node are linearly separable functions.6. An example rule from the tree with the repeating subtree of our example would be: x1 ∧ ¬x2 ∧ x3 ∧ x4 ⊃ 1. as shown in Fig. for example. THE PROBLEM OF MISSING ATTRIBUTES 85 x1 x3 x2 0 x3 x4 0 0 1 1 x4 0 1 Figure 6.
and nearestneighbor classiﬁers on a wide variety of problems. 1991.10: A Decision Graph 6. Mooney. et al. 1990. [Taylor. for example. & Spiegalhalter.7 The Problem of Missing Attributes To be added.86 CHAPTER 6. see [Dietterich. Quinlan.6 6. Shavlik. There seems x1x2 x3x4 1 0 1 Figure 6. DECISION TREES x1 x2 x3 1 0 x4 0 1 Figure 6. & Towell. neuralnet. Comparisons Several experimenters have compared decisiontree. Michie. 1994] give thorough comparisons of several machine learning algorithms on several diﬀerent types of problems.11: A Multivariate Decision Tree . 1994]. For a comparison of neural nets versus decision trees. In their StatLog project..
6. on the other. 6. although [Quinlan.8.8 Bibliographical and Historical Remarks To be added. . on the one hand. 1994] does provide some intuition about properties of problems that might render them ill suited for decision trees. there do not seem to be any general conclusions that would enable one to say which classiﬁer method is best for which sorts of classiﬁcation problems. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 87 to be no single type of classiﬁer that is best for all problems. or backpropagation. And.
DECISION TREES .88 CHAPTER 6.
As with any learning problem. Of course. we consider the matter of learning logic programs given a set of variable values for which the logic program should return T (the positive instances) and a set of variable values for which it should return F (the negative instances). the representation most important in computer science is a computer program.True(x). and neural networks. we have seen (Boolean) algebraic expressions. In this chapter. ¬ True(x) We follow Prolog syntax (see.) Programs will be written in “typewriter” font. for example. The unary function “True” returns T if and only if the value of its argument is T . 1994]. this one can be quite c z complex and intractably diﬃcult unless we constrain it with biases of some sort. Similarly. except that our convention is to write variables as strings beginning with lowercase letters and predicates as strings beginning with uppercase letters. ¬ True(y) :. 89 . [Mueller & Page.Chapter 7 Inductive Logic Programming There are many diﬀerent representational forms for functions of input variables. (We now think of Boolean functions and arguments as having values of T and F instead of 0 and 1. the Boolean exclusiveor (odd parity) function of two variables can be computed by the following logic program: Parity(x. For example. 1988]). The subspecialty of machine learning that deals with learning logic programs is called inductive logic programming (ILP) [Lavraˇ & Dˇeroski. decision trees.True(y). a Lisp predicate of binaryvalued inputs computes a Boolean function of those inputs. plus other computational mechanisms such as techniques for computing nearest neighbors. For example.y) :. a logic program (whose ordinary application is to compute bindings for variables) can also be used simply to decide whether or not a predicate has value True (T) or False (F). So far.
y) :. the program above would return T for input (A.y) :. (That is. not allow recursion. (A.x) which would have value T if both of the two cities were hub cities or if one were a satellite of the other. we implicitly append the background facts to the program and adopt the usual convention that a program has value T for a set of inputs if and only if the program interpreter returns T when actually running the program (with background facts appended) on those inputs. that is.B).90 CHAPTER 7. If a logic program. As positive examples. Depending on the exact set of examples. and so on. not allow functions. we say that the program covers the arguments and write covers(π. we might have (A1. we will say that a program is suﬃcient if it covers all of the positive instances and that it is necessary if it does not cover any of the negative instances. In ILP.Satellite(x. plus others.) From these training facts. as negative examples. the background information might be such ground facts as Hub(A). Hub(B). if presented with arguments not represented in the training set (but for which we have the needed background knowledge). otherwise it has value F . Satellite(A1. INDUCTIVE LOGIC PROGRAMMING In ILP. that has value T for all the positive instances and has value F for all the negative instances.1 Notation and Deﬁnitions In evaluating logic programs in ILP. we want to induce a program Nonstop(x. X). returns T for a set of arguments X. and some other pairs. 7. there are a variety of possible biases (called language biases). called “background knowledge. A1). we usually have additional information about the examples. we might induce the program: Nonstop(x. One might restrict the program to Horn clauses. we might have (A.y). and some other pairs. (Hub(A) is intended to mean that the city denoted by A is a hub city. for example. and Satellite(A1. π.y). it .Hub(x). A2).A) is intended to mean that the city denoted by A1 is a satellite of the city denoted by A. Following our terminology introduced in connection with version spaces.” In our airﬂight problem. As with other learning problems. Hub(y) :. we would like the function to guess well. written in terms of the background relations Hub and Satellite. we want the induced program to generalize well.Satellite(y. a program implements a suﬃcient condition that a training instance is positive if it covers all of the positive training instances. A1). suppose we are trying to induce a function Nonstop(x.A). that is to have value T for pairs of cities connected by a nonstop air ﬂight and F for all other pairs of cities. We are given a training set consisting of positive and negative examples. As an example of an ILP problem. Using the given background facts.
if it is necessary but not suﬃcient. is the one that has value F for all inputs. The most special logic program.]. and Consistent Programs As in version spaces. The most general logic program.1. We will be discussing a method that starts with [ρ :.] and specialize until the program is consistent. Suppose we are attempting to induce a logic program to compute the relation ρ. [ρ :]. namely [ρ :F ]. Necessary. 1 is a necessary program A positive instance covered by 2 and 3 + + + + + + + + + 2 is a sufficient program + 3 is a consistent program Figure 7.1: Suﬃcient.2.F ] and generalize until the program is consistent. we might relax this criterion and settle for a program that covers all but some fraction of the positive instances while allowing it to cover some fraction of the negative instances. it can be made to cover more examples by generalizing it. in which case we will call it consistent. we must next discuss those operations. is the one that has value T for all inputs. then reachieves suﬃciency in stages by generalizing—ensuring within each stage that the program remains necessary (by specializing). 7.) In the noiseless case.7. A GENERIC ILP ALGORITHM 91 implements a necessary condition if it covers none of the negative instances. which is certainly suﬃcient. if a program is suﬃcient but not necessary it can be made to cover fewer examples by specializing it. There are . Conversely. We illustrate these deﬁnitions schematically in Fig. 7. we want to induce a program that is both suﬃcient and necessary. namely a single clause with an empty body. With imperfect (noisy) training sets. or 2) start with [ρ :. which is called a fact in Prolog.2 A Generic ILP Algorithm Since the primary operators in our search for a consistent program are specialization and generalization. Two of the many diﬀerent ways to search for a consistent logic program are: 1) start with [ρ :. which is certainly necessary. specializes until the program is necessary (but might no longer be suﬃcient).
Remove a clause from the program We will be presenting an ILP learning method that adds clauses to a program when generalizing and that adds literals to the body of a clause when specializing. A literal that equates a variable in the head of the clause with another such variable or with a term mentioned in the background knowledge. (This possibility is equivalent to forming a specialization by making a substitution. is that a clause c1 is more special than a clause c2 if the set of literals in the body of c2 is a subset of those in c1 . d. c. A reﬁnement graph then tells us the ways in which we can specialize a clause by adding a literal to it. (Readers familiar with substitutions in the predicate calculus will note that this process is the inverse of substitution. b. there are three ways in which a logic program might be specialized: a. Remove literals from the body of a clause. Replace some variables in a program clause by terms (a substitution). INDUCTIVE LOGIC PROGRAMMING three major ways in which a logic program might be generalized: a. which is what we use here.) . In general. Clauses can be partially ordered by the specialization relation.92 CHAPTER 7. Literals that introduce a new distinct variable diﬀerent from those in the head of the clause. clause c1 is more special than clause c2 if c2 = c1 . Literals whose arguments are a subset of those in the head of the clause. When we add a clause. that is similar to a version space. Add a clause to the program Analogously. Clause c1 is an immediate successor of clause c2 in this graph if and only if clause c1 can be obtained from clause c2 by adding a literal to the body of c2 . This ordering relation can be used in a structure of partially ordered clauses. A special case. Practical ILP systems restrict the literals in various ways. Thus. Of course there are unlimited possible literals we might add to the body of a clause. Add literals to the body of a clause. called the reﬁnement graph. we will always add the clause [ρ :. c. Literals used in the background knowledge. we need only describe the process for adding literals. c. Typical allowed additions are: a.] and then specialize it by adding literals to the body. b.) b. Replace some terms in a program clause by variables.
Ξ of argument sets some known to be in the relation ρ and some not in ρ. A GENERIC ILP ALGORITHM 93 e. ILP programs that follow the approach we are discussing (of specializing clauses by adding a literal) thus have well deﬁned methods of computing the possible literals to add to a clause. It has an inner loop for constructing a clause. π for inducing a relation ρ.y) Satellite(y.z) Satellite(z. and the cur negative ones by Ξ− .) c z . we could also add the literals Nonstop(x. The algorithm has an outer loop in which it successively adds clauses to make π more and more suﬃcient.]. that is more and more necessary and in which it refers only to a subset. p. c. Thus the literals that we might consider adding are: Hub(x) Hub(y) Hub(z) Satellite(x. Now we are ready to write down a simple generic algorithm for inducing a logic program. We are given a training set.2.) These possibilities are among those illustrated in the reﬁnement graph shown in Fig.2.y) (x = y) (If recursive programs are allowed.) The algorithm is also given background relations and cur the means for adding literals to a clause. Ξcur . they are all syntactic ones from which the successors in the reﬁnement graph are easily computed. The literals used in the background knowledge are Hub and Satellite. (The positive instances in Ξcur will be denoted by Ξ+ . We start with the program [Nonstop(x. The algorithm can be written as follows: Generic ILP Algorithm (Adapted from [Lavraˇ & Dˇeroski. 1994. which are disallowed in some systems. Whatever restrictions on additional literals are imposed. (This possibility admits recursive programs. and Ξ− are the negative instances. It uses a logic program interpreter to compute whether or not the program it is inducing covers training instances.7.x) Satellite(x. A literal that is the same (except for its arguments) as that in the head of the clause.) We can illustrate these possibilities using our airﬂight example. 7.z) and Nonstop(z. Ξ+ are the positive instances.y). of the training instances.y) :. 60].
repeat [The inner loop makes c necessary.y) :. .94 CHAPTER 7.] Assign π := π.] Assign Ξcur := Ξcur − (the positive instances in Ξcur covered by π).y) :Hub(x) . . Cities A. [That is. B. INDUCTIVE LOGIC PROGRAMMING Nonstop(x. .] Assign c := c. shown in Fig. Figure 7. . Consider the portion of an airline route map.) 7. Nonstop(x.3 An Example We illustrate how the algorithm works by returning to our example of airline ﬂights.] Initialize c := ρ : − . [We add the clause c to the program. Nonstop(x. . . 7. The other . Nonstop(x. . Nonstop(x. .2: Part of a Reﬁnement Graph Initialize Ξcur := Ξ.Hub(x). . Initialize π := empty set of clauses. .y) :(x = y) . Hub(y) . . .y) :. and we know that there are nonstop ﬂights between all hub cities (even those not shown on this portion of the route map). until c covers no negative instances in Ξcur . [This is a nondeterministic choice point.y) . until π is suﬃcient. . .] Select a literal l to add to c. until c is necessary. c. and C are “hub” cities. . (The termination tests for the inner and outer loops can be relaxed as appropriate for the case of noisy instances.3. . repeat [The outer loop works to make π suﬃcient.y) :Satellite(x. l.
< C. A >. C >. A >. B >. < B2. < A2. < C. B1 >. B1 B2 B A1 C1 A A2 C C2 Figure 7. < A2. C2 >. B >. B2 >. < C. The learning program is given a set of positive instances. < C. < C2. < A1. 7. < B. < B. we will assume that Ξ− contains all those pairs of cities shown in Fig.3.3 that are not in Ξ+ (a type of closedworld assumption). C2 >. A2 >. so the training set does not necessarily exhaust all the cities. C >. < B2. < A. < A.7. < B. A1 >. < C1. C1 >. C >. and we know that there are nonstop ﬂights between each satellite city and its hub. < B. The training set. A >.3: Part of an Airline Route Map We want the learning program to induce a program for computing the value of the relation Nonstop. < C2. < B. B1 >. < B1. A1 >. B2 >. < B. < B1. AN EXAMPLE 95 cities are “satellites” of one of the hubs. C >. B >. Ξ. C1 >. B >. Ξ− . of pairs of cities between which there are not nonstop ﬂights. < A. C >. can be thought of as a partial . < A. A >. Ξ+ contains just the pairs: {< A. Ξ+ . < C1. < C. B >. A >. < A. A >. B >. A1 >. < C. C >. < A2. < A1. B1 >. A >. < C. A >. B >. of pairs of cities between which there are nonstop ﬂights and a set of negative instances. < C2. C >} There may be other cities not shown on this map. These are: {< A. C1 >. < C. C2 >. A2 >. < C1. < A. < A1. < B. < B1. C >} For our example. B2 >. < B2. B >. < B. A2 >.
B >. The following positive instances in Ξ are covered by Nonstop(x. C >. C >. < C. A1 >. < C. < A2. intensional. We will use the notation Hub(x) to express that the city named x is in the relation Hub. < B1. C1 >. This clause is not necessary because it covers all the negative examples (since it covers all examples). Doing so will give us a more compact. So we must add a literal to its (empty) body. A >. We desire to learn the Nonstop relation as a logic program in terms of the background relations.96 CHAPTER 7. C >.y) to express that the pair < x. A. Knowing that the predicate Nonstop is a twoplace predicate. < B2. < C.Hub(x): {< A. we interpret the logic program Nonstop(x. the inner loop of our algorithm initializes the ﬁrst clause to Nonstop(x.y) :Hub(x) for all pairs of cities in Ξ. B >. < C2. < C >} All other cities mentioned in the map are assumed not in the relation Hub. using the pairs given in the background relation Hub as ground facts.y) :. < B. C >} All other pairs of cities mentioned in the map are not in the relation Satellite. >. < A. Hub and Satellite. We will use the notation Satellite(x. < C1.y) :. < A. A >. which are also given in extensional form. B >. description of the relation. INDUCTIVE LOGIC PROGRAMMING description of this relation in extensional form—it explicitly names some pairs in the relation and some pairs not in the relation. We assume the learning program has the following extensional deﬁnitions of the relations Hub and Satellite: Hub {< A >. B >. The following negative instances are also covered: . y > is in the relation Satellite. < B >. Satellite {< A1. < B. Suppose (selecting a literal from the reﬁnement graph) the algorithm adds Hub(x). < C. A2 >. < B. B2 >. C2 >} To compute this covering.. < A. A >. and this description could well generalize usefully to other cities not mentioned in the map. < B. B1 >.
< A. B2 >. A1 >. < A. C2 >. The program now contains two clauses: Nonstop(x.Satellite(x. C1 >. B2 >. and we can terminate the ﬁrst pass through the inner loop. Suppose we next add Hub(y). < A. C >. < A. C >.Hub(x). A >. A >. < A2. Hub(y) is necessary. π. < C. < C.y) covers no negative instances. Hub(y) :. < C1. < B.Hub(x). Hub(y) are removed from Ξ to form the Ξcur to be used in the next pass through the inner loop. B1 >.3. C2 >} . These positive instances are not covered by the clause: {< A. B2 >. < B. C1 >. A1 >. B2 >.y) :. B1 >. C2 >} 97 Thus.y) :. < C. < C. < B.Hub(x). C >. < B. C >} These instances are removed from Ξcur for the next pass through the inner loop. A >. B1 >. A1 >. Ξcur consists of all the negative instances in Ξ plus the positive instances (listed above) that are not yet covered.y) This program is not yet suﬃcient since it does not cover the following positive instances: {< A. A >. C2 >. In order to attempt to cover them. A >. < B1.Hub(x). < B. A >. B >. < C. The clause Nonstop(x.y) :. the inner loop creates another clause c. < C. < B1. < C2.Satellite(x. C1 >. B1 >. < C.7. B >. B >} There are no longer any negative instances in Ξ covered so the clause Nonstop(x. The following positive instances are covered by Nonstop(x. < C. C >. A1 >. C >} The positive instances that were covered by Nonstop(x. and so we must add literals to make it necessary. Suppose we add the literal Satellite(x. Hub(y): {< A. A2 >. < C. < B. < C. B >. < A2. < A. AN EXAMPLE {< A. But the program.y) :. < A. so it is necessary. < C1. B >. consisting of just this clause is not suﬃcient. B >. the clause is not yet necessary and another literal must be added. < B2. C1 >. A2 >. < C2. A2 >..y) :. < A1. < B.y) :. This clause covers all the negative instances. A2 >. < B. < B2.y). < B. It does cover the following positive instances in Ξcur : {< A1. initially set to Nonstop(x. < B.
Satellite(x. < C2. C1 >. C1 >. B1 >. C >. B >. Again.Satellite(y. B1 >. B2 >.4 Inducing Recursive Programs To induce a recursive program. C1 >. < B2. < B1. < B1. < C1.y) :. we allow the addition of a literal having the same predicate letter as that in the head of the clause. C2 >.y) :. B1 >. We have introduced two new cities. With that extension. B2 >. B >.4. < B2.x). < B. Note that this program can be applied (perhaps with good generalization) to other cities besides those in our partial map—so long as we can evaluate the relations Hub and Satellite for these other cities. < C2. This clause is necessary. B2 >. the procedure terminates with: Nonstop(x. < C. Various mechanisms must be used to ensure that such a program will terminate. The relation Canfly is satisﬁed by the following pairs of postive instances: {< B1. B and C are hub cities.x) Since each clause is necessary. B >. < B1. C >. < C. < C. C >. We now seek to learn a program for Canfly(x. < C1.y) :. B >.Satellite(y. C1 >. < B1. < C2. < C1. No ﬂights exist between these cities and any other cities—perhaps there are only bus routes as shown by the grey lines in the map. 7. < C1. Our example continues the one using the airline map. Consider the map shown in Fig. The process is best illustrated with another example. C2 >. one such is to make sure that the new literal has diﬀerent variables than those in the head literal. < B. and since the program containing all three clauses is now suﬃcient. B2 >. we add the clause Nonstop(x. C2 >. the method can be used to induce more general logic programs. B1 >. < C2. B1 and B2 are satellites of B.Hub(x). but we make the map somewhat simpler in order to reduce the size of the extensional relations used. C2 >. we show how the technique can be extended to use recursion on the relation we are inducing. < B. B >. < B2. and the whole program is suﬃcient. C2 >. C1 >} . In the next section. < C. < B. 7. < C. C >. the program is also consistent with all instances of the training set. < C2. B1 >.98 CHAPTER 7. B3 and C3. B2 >. C >. < B2. INDUCTIVE LOGIC PROGRAMMING During the next pass through the inner loop. < B2. C1 and C2 are satellites of C. < C1. Hub(y) :. < B.y) that covers only those pairs of cities that can be reached by one or more nonstop ﬂights.
C3 >. C >. < B. B1 >. C2 >. < B. B1 >. < C2. Suppose that the inner loop adds the background literal Nonstop(x. As before. B >. < B2. C3 >. INDUCING RECURSIVE PROGRAMS 99 B3 B1 B2 B C1 C C3 C2 Figure 7. < C3. B3 >. C3 >. < C3. C1 >. < B2. C1 >. < B3. < B1. C2 >. C1 >. < C. C3 >. B2 >.y) :. < B2. C1 >. C >. B >. The clause Canfly(x.y) is necessary. < C3. C >. B1 >.Nonstop(x. < B1. it covers no negative instances.4: Another Airline Route Map Using a closedworld assumption on our map. < B1. < B2.y) using the extensionally deﬁned background relation Nonstop given earlier (modiﬁed as required for our reduced airline map) and Canfly itself (recursively). B3 >. B1 >. . C2 >. C1 >. B2 >. C2 >. < C1. < C. < C2.4. B3 >. B3 >. < B3.7. < C2. B3 >. C3 >. < C1. B2 >. < C3. < C. < C3. B2 >. B1 >. B3 >. < B. C >. B2 >. < B2.y). < C2. But it is not suﬃcient because it does not cover the following positive instances: {< B1. < B3. we take the negative instances of Canfly to be: {< B3. C3 >} We will induce Canfly(x. < B1. < B3. < C. B3 >. < C1. < C1. < B1. < B2. < B3. C3 >. we start with the empty program and proceed to the inner loop to construct a clause that is necessary. < C3. B2 >. B1 >. < C3. < B3.
C2 >.Nonstop(x.z) for some z. < C1. we must add another clause to the program. but it also covers many negative instances such as < B2. It is convenient to express this probability in “odds form. p =(number of positive instances covered by the clause)/(total number of instances covered by the clause). < C2.B).z). The interpreter attempts to establish Nonstop(B1.Nonstop(x. We digress brieﬂy to describe how a program containing a clause with unbound variables in its body is interpreted. This clause is necessary. B3 >. In FOIL. no negative instances are covered. and < B. The program is now suﬃcient and consistent. C2 >. That is. The interpreter attempts to establish Nonstop(B3.Nonstop(x. Expressing the probability in terms of the odds.y) :. B).Nonstop(x. so the clause does not cover < B3. In the inner loop. .y) :. it is: Canfly(x. is a background fact. o. we ﬁrst create the clause Canfly(x. B >.y) to yield the clause Canfly(x. < C1. There are no background facts that match. A major problem involves deciding how to select a literal to add in the inner loop (from among the literals that are allowed). the interpreter returns T —which means that the instance < B1. Canfly(z. B >. B3 >. C1 >} Thus. Suppose we try to interpret it for the positive instance Canfly(B1. INDUCTIVE LOGIC PROGRAMMING < B. < C2.y) 7. Canfly(z.B2). A measure that gives the same comparison as does Quinlan’s is based on the amount by which adding a literal increases the odds that an instance drawn at random from those covered by the new clause is a positive instance beyond what these odds were before adding the literal. suppose it adds Canfly(z. Suppose now.5 Choosing Literals to Add One of the ﬁrst practical ILP systems was Quinlan’s FOIL [Quinlan. we attempt to interpret the clause for the negative instance Canfly(B3.y) :Nonstop(x. we obtain p = o/(1 + o).y) :. Let p be an estimate of the probability that an instance drawn at random from those covered by a clause before adding the literal is a positive instance.z) covers all of the positive instances not already covered by the ﬁrst clause. Quinlan suggested that candidate literals can be compared using an informationlike measure—similar to the measures used in inducing decision trees.z) for some z.y) :. for example. we see that the clause Canfly(x.z) which introduces the new variable z.y). Using the interpreter. 1990]. B2 > is covered. that a covered instance is positive is deﬁned to be o = p/(1 − p). Since Nonstop(B1. So the inner loop must add another literal.” The odds. B >.100 CHAPTER 7. This time.z).
E. For example. E. 1992. Muggleton. postprocessing. Lavraˇ & Dˇeroski. some of the instances previously covered are still covered. Muggleton. The odds will be denoted by ol . then one possible binary split . that gives maximal increase in these odds. we want a literal that gives a high value of λl . F }. E. 1992]. if a node tested the variables xi and xj .) Besides ﬁnding a literal with a high value of λl . It is also possible to make a multivariate split—testing the values of two or more variables at a time. B. We want to select a literal. B. b) place further restrictions on the variables if the literal selected has the same predicate letter as the literal being induced (in order to prevent inﬁnite recursion). The reader should also refer to [Pazzani & Kibler. With categorical variables. C. That is. the split at each node involves asking to which of several mutually exclusive and exhaustive subsets the value of a variable belongs. Let pl denote the probability that an instance drawn at random from the instances covered by the new clause (with l added) is positive. We refer the reader to Quinlan’s paper for further discussion of these points. and if xi could have values drawn from {A. RELATIONSHIPS BETWEEN ILP AND DECISION TREE INDUCTION101 After selecting a literal. Quinlan also discusses postprocessing pruning methods and presents experimental results of the method applied to learning recursive relations on lists. and if xi and xj both could have values drawn from {A. When splitting on a single variable. if we deﬁne λl = ol /o. D. 1994. an nvariable split would be based on which of several nary relations the values of the variables satisﬁed. some of these are positive and some are negative. C} or one of {D. l. B. F }. and for some other standard tasks mentioned in the machine learning literature.6 Relationships Between ILP and Decision Tree Induction The generic ILP algorithm can also be understood as a type of decision tree induction. bottomup methods. For example. C. Recall the problem of inducing decision trees when the values of attributes are categorical. so we could just as well use the latter instead. Quinlan’s FOIL system also restricts the choice to literals that: a) contain at least one variable that has already been used. F }. if a node tested the variable xi . and c) survive a pruning test based on the values of λl for those literals selected so far. and LINUS. 7. (It turns out that the value of Quinlan’s information theoretic measure increases monotonically with λl . Specializing the clause in such a way that it fails to cover many of the negative instances previously covered but still covers most of the positive instances previously covered will result in a high value of λl . l. to add to a clause. on learning rules for chess endgames and for the card game Eleusis. c z Discuss preprocessing. D. 1991.7. then one possible split (among many) might be according to whether the value of xi had as value one of {A.6.
so that Node 2 is satisﬁed only by the positively labeled samples in Ξ2 . . Thus we will speak of a toplevel decision tree and various subdecision trees. . these satisfy Node 1 at the top level. the patterns in Ξ are ﬁrst ﬁltered through the decision tree in toplevel node 1. that distinguishes positive from negative patterns in Ξ is then given in terms of the following logic program: R :. The intensional deﬁnition will be in terms of a logic program in which the relation R is the head of a set of clauses whose bodies involve the background relations. the ILP problem is as follows: We are given a training set. That is.) In this framework. and the rest are ﬁltered to the left (more on what happens to these later).) Let us call the subset of patterns satisfying these relations. . In our example. . on various subsets of these variables. Rk . . The background relation R1 is satisﬁed by some of these patterns. R1 . They correspond to the clause created in the ﬁrst pass through the inner loop of the generic ILP algorithm. In this diagram. . y. The generic ILP algorithm can be understood as decision tree induction. we are given sets of tuples that are in these relations. Ξ4 contains only negatively labeled patterns and the union of Ξ1 and Ξ3 contains all the positively labeled patterns. C >. . < C. (That is.) In broad outline. the subset of patterns satisfying all the relations. R1 . (Actually. and R3 contains only positive instances from Ξ. All other patterns. Ri . Ξ. xj > satisﬁed the relation {< A.102 CHAPTER 7. R2. such that all of the positively labeled patterns in Ξ are satisﬁed by R and none of the negatively labeled patterns are. that is {Ξ − Ξ1 } = Ξ2 are ﬁltered to the left by Node 1. INDUCTIVE LOGIC PROGRAMMING (among many) might be according to whether or not < xi . where each node of the decision tree is itself a subdecision tree. the method for inducing an intensional version of the relation R is illustrated by considering the decision tree shown in Fig. D >}. Rightgoing patterns are ﬁltered through a sequence of relational tests until only positively labeled patterns satisfy the last relation—in this case R3 . . z. these are ﬁltered to the right (to relation R2 ). . Ξ2 is then ﬁltered by toplevel Node 2 in much the same manner. . Ξ1 . (Note that our subset method of forming singlevariable splits could equivalently have been framed using 1ary relations—which are usually called properties. Rk . our decision trees will be decision lists—a special case of decision trees. R. (We might say that this combination of tests is necessary. of positively and negatively labeled patterns whose components are drawn from a set of variables {x.R1. The relation. R3 .}. We are also given background relations. 7. The positively labeled patterns in Ξ form an extensional deﬁnition of a relation. R2 . .5. R. but we will refer to them as trees in our discussions.) We desire to construct an intensional deﬁnition of R in terms of the R1 . and each subdecision tree consists of nodes that make binary splits on several variables using the background relations. We continue ﬁltering through toplevel nodes until only the negatively labeled patterns fail to satisfy a top node.
7. C >. < A. < B. 7. < C. A >. B >. 7. B. A1. 7. the relation. A >.3 (using the closed world assumption) are not in the relation Nonstop and thus are negative instances. < C. RELATIONSHIPS BETWEEN ILP AND DECISION TREE INDUCTION103 Node 1 R1 F F T R2 T R3 F T (only positive instances satisfy all three tests) T 1 Node 2 R4 F R5 F F 4= 2 (only negative 3 instances) F 2= T T (only positivel instances satisfy these two tests) T 3 1 Figure 7.5: A Decision Tree for ILP :. C >. C1. B >. A2. C1 >. < A. decisiontree induction would be a very diﬃcult task—involving as it does the need to invent relations on . C. C2 >. B >.6. < C2. < C1. A >.R4. < B. C >. < B2. The logic program resulting from this decision tree is the same as that produced by the generic ILP algorithm. C2} except (for simplicity) we do not allow patterns in which x and y have the same value. As before. A1 >. A2 >. B >. R5 If we apply this sort of decisiontree induction procedure to the problem of generating a logic program for the relation Nonstop (refer to Fig. B1. < B. contains the following pairs of cities. In setting up the problem. A >. < A. The values of these components range over the cities {A. B2 >. < C. < A2. Ξ can be expressed as a set of 2dimensional vectors with components x and y. < C. C >} All other pairs of cities named in the map of Fig. Nonstop. < B1. < A1.3). the training set. < B. Because the values of x and y are categorical. B1 >. B2. which are the positive instances: {< A.6. we obtain the decision tree shown in Fig.
7 To be added. from among the available tests. We select these relations in the same way that we select literals. the problem is made much easier. Bibliographical and Historical Remarks . But with the background relations. we make a selection based on which leads to the largest value of λRi . Ri (in this case Hub and Satellite). 7.104 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING x and y to be used as tests.
<A.B>} (Only positive instances) Satellite(x.C>.C>. <C. <B1.B>.y) F T T F Node 3 (top level) {<A1.7.7. <A. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 105 Node 1 (top level) F Hub(x) T Hub(y) F T T Node 2 (top level) F {<A.A>.C1>. <A2. <B.C2>} (Only positive instances) {Only negative instances} Figure 7. <C.<B.B1>.B>. <C. <C.6: A Decision Tree for the Airline Route Problem .B2>.C>} (Only positive instances) Satellite(y.A>.B>. <C1.A>. <B. <B.A>.C>. <B2.A1>. <C2.A2>.x) F T T F {<A.
106 CHAPTER 7. INDUCTIVE LOGIC PROGRAMMING .
Xi . P . Other overviews? 8. is P (X). 1988. m. from among a restricted set of hypotheses such that with high probability the function we select will be approximately correct (small probability of error) on subsequent samples drawn at random according to the same distribution from which the labeled samples were drawn. That is. 1990] give nice surveys of the important results. . . (In the literature of PAC learning theory. a given training set of sample patterns might be adequate to allow us to select a function. .1 Notation and Assumptions for PAC Learning Theory We assume a training set Ξ of ndimensional vectors. This insight led to the theory of probably approximately correct (PAC) learning—initially developed by Leslie Valiant [Valiant. i = 1. the target function is usually called the target concept and is denoted by c. 1984]. or later being presented to the learner. each labeled (by 1 or 0) according to a target function.) Our problem is to guess 107 . The probability distribution. [Dietterich. can be arbitrary. but to be consistent with our previous notation we will continue to denote it by f . Haussler. The probability of any given vector X being in Ξ. We present here a brief description of the theory for the case of Boolean functions. 1990. consistent with the labeled samples. . Haussler. which is unknown to the learner. f . We gave some intuitive arguments to support the claim that after seeing only a small fraction of the possible inputs (and their values) that we could guess almost correctly the values of most subsequent inputs—if we knew that the function we were trying to guess belonged to an appropriately restricted subset of functions.Chapter 8 Computational Learning Theory In chapter one we posed the problem of guessing a function given a set of sample inputs and their values.
in the dimension. is polynomially PAC learnable in terms of H provided there exists a polynomialtime learning algorithm (polynomial in the number of samples needed. h won’t be identical to f . To quantify this notion. such an h might turn out to be approximately correct (for a given value of ε). P (X) We say that h is approximately (except for ε ) correct if εh ≤ ε. C. we could always produce a consistent hypothesis from this set by testing all of them against the training data.108 CHAPTER 8. C. as the probability that an X drawn randomly according to P will be misclassiﬁed: εh = [X:h(X)=f (X)] Boldface symbols need to be smaller when they are subscripts in math environments. h is consistent with this randomly selected training set. This can only be done for certain classes of functions. and these are summarized in a table to be presented later. In general. and on others it might not. where ε is the accuracy parameter. H is called the hypothesis space. it outputs a hypothesis h H. εh ≤ ε. h. and in 1/δ) that PAClearns functions in C in terms of H. We assume that the target function is some element of a set of functions. will such an h be approximately correct (and for what value of ε)? On some training occasions. based on the labeled samples in Ξ. so we want an algorithm that PAClearns functions in polynomial time. in 1/ε. Such a hypothesis is called probably (except for δ) approximately (except for ε) correct. we want h to be approximately correct. such that with probability at least (1 − δ). But if there are an exponential number of hypotheses. is an element of a set. εh . n. If m is large enough. we deﬁne the error of h. We want learning algorithms that are tractable. C. that is. but it was later shown that some functions cannot be polynomially PAClearned under such an assumption (assuming . A class. that would take exponential time. using m randomly drawn training samples. h(X). COMPUTATIONAL LEARNING THEORY a function. which includes the set. of hypotheses. of target functions. H. We seek training methods that produce consistent hypotheses in less time. Suppose we are able to ﬁnd an h that classiﬁes all m randomly drawn training samples correctly. Initial work on PAC assumed H = C. The time complexities for various hypothesis sets have been determined. We say that h is probably (except for δ) approximately correct (PAC) if the probability that it is approximately correct is greater than 1 − δ. That is. where δ is the conﬁdence parameter. but we can strive to have the value of h(X) = the value of f (X) for most X’s. In general. we say that a learning algorithm PAClearns functions from C in terms of H iﬀ for every function f C. We also assume that the hypothesis. such an h is guaranteed to be probably approximately correct. In PAC theory such a guessed function is called the hypothesis. m. Ξ. We shall show that if m is greater than some bound whose value depends on ε and δ. If there are a ﬁnite number of hypotheses in a hypothesis set (as there are for many of the hypothesis sets we have considered).
We will call the hypotheses in this subset bad. Then.8. Since C and H do not have to be identical. prob[hb classiﬁes all m patterns correctly hb HB ] ≤ (1 − ε)m . . h2 . the probability that there exists a hypothesis h consistent with f for the members of Ξ but with error greater than ε is at most He−εm .2. . f be any classiﬁcation function in H. HB ] . . from which patterns are drawn nor does it say anything about the properties of the learning algorithm. HB .2.2 8. we have the further restrictive deﬁnition: A properly PAClearnable class is a class C for which there exists an algorithm that polynomially PAClearns functions from C in terms of C. in H. where K = HB . . The probability that hi will classify a pattern correctly is (1−εhi ). P . Since εhb > ε. A subset. would classify a pattern correctly is (1 − εhb ). . say hb . That is. The probability that the error of this randomly selected h is greater than some ε. where H is the number of hypotheses in H. The probability that it would classify all m independently drawn patterns correctly is then less than (1 − ε)m . hS }.1 (Blumer. 1987]: Theorem 8. the probability that hb (or any other bad hypothesis) would classify a pattern correctly is less than (1 − ε). hi . Proof: Consider the set of all hypotheses. et al.) Let H be any set of hypotheses. . et al.1 PAC Learning The Fundamental Theorem Suppose our learning algorithm selects some h randomly from among those that are consistent with the values of f on the m training patterns. PAC LEARNING 109 P = NP)—but can be polynomially PAClearned if H is a strict superset of C! Also our deﬁnition does not specify the distribution. and ε > 0. diﬀerently than f would classify it). where S = H. prob[some h HB classiﬁes all m patterns correctly] = hb HB prob[hb classiﬁes all m patterns correctly hb ≤ K(1 − ε)m . We state this result as a theorem [Blumer. The error for hi is εhi = the probability that hi will classify a pattern in error (that is. .. . with h consistent with the values of f (X) for m instances of X (drawn according to arbitrary P ). Ξ be a set of m ≥ 1 training examples drawn independently according to some distribution P . 8. The probability that any particular one of these bad hypotheses. is less than or equal to He−εm . of H will have error greater than ε. {h1 .
No class larger than that can be guaranteed to be properly PAC learnable. First it clearly states that we can select any hypothesis consistent with the m samples and be assured that with probability (1 − δ) its error will be less than ε. the probability that there exists a hypothesis in H that is consistent with f on these samples and has error greater than ε is at most δ. Taking the natural logarithm of both sides yields: ln H − εm ≤ ln δ or m ≥ (1/ε)(ln H + ln(1/δ)) QED This corollary is important for two reasons. H can be no larger k than 2O(n ) . Proof: We are to ﬁnd a bound on m that guarantees that prob[there is a hypothesis with error > ε and that classiﬁes all m patterns correctly] ≤ δ. Values of m greater than that bound are suﬃcient (but might not be necessary).110 That is. it shows that in order for m to increase no more than polynomially with n. using the result of the theorem. Since K ≤ H and (1 − ε)m ≤ e−εm . Also. QED A corollary of this theorem is: Corollary 8.2 Given m ≥ (1/ε)(ln H + ln(1/δ)) independent samples. CHAPTER 8. Thus. Here is a possible point of confusion: The bound given in the corollary is an upper bound on the value of m needed to guarantee polynomial probably approximately correct learning. we must show that He−εm ≤ δ. . COMPUTATIONAL LEARNING THEORY prob[there is a bad hypothesis that classiﬁes all m patterns correctly] ≤ K(1 − ε)m . we have: prob[there is a bad hypothesis that classiﬁes all m patterns correctly] = prob[there is a hypothesis with error > ε and that classiﬁes all m patterns correctly] ≤ He−εm . We will present a lower (necessary) bound later in the chapter.
we have h = x2 x3 x4 . 1. .01. For n = 50. we exit with the null concept (h ≡ 0 for all patterns). we check all of the patterns labeled with a 0 to make sure that none of them is assigned value 1 by h. 0) (1. we have h = x1 x2 x3 x4 . Linearly Separable Functions Let H be the set of all linearly separable functions. then there exists no term that consistently classiﬁes the patterns in Ξ. If. and m ≥ (1/ε)(ln(3n ) + ln(1/δ)) ≥ (1/ε)(1.8. at any stage of the algorithm. Initialize a Boolean function. In order to show that terms are properly PAC learnable. we additionally have to show that one can ﬁnd in time polynomial in m and n a hypothesis h consistent with a set of m patterns labeled by the value of a term. after processing the second pattern.2. all labeled with a 1 (from [Dietterich. The following procedure for ﬁnding such a consistent hypothesis requires O(nm) steps (adapted from [Dietterich. m ≥ 5. Otherwise. Then. After processing all the patterns labeled with a 1. Xi . Ξ. PAC LEARNING 111 8. Find the ﬁrst pattern. ﬁnally. H = 3n . in that list that is labeled with a 1. 1. ε = 0.) If there are no patterns labeled by a 1. page 268]): We are given a training sequence. (Components with value 1 will have corresponding positive literals. we exit with h. 1990]): (0. and we exit with failure. Then. that is labeled with a 1. Then.2. As an example. for each additional pattern. 0) (1. 1.1n + ln(1/δ)) Note that the bound on m increases only polynomially with n. any patterns labeled with a 0 are assigned a 1 by h. H ≤ 2n . h. 1/ε. components with value 0 will have corresponding negative literals. 0. after the third pattern. 0) After processing the ﬁrst pattern. we have h = x2 x4 . to the conjunction of the n literals corresponding to the values of the n components of X1 . 961 guarantees PAC learnability. of m examples. 1. and 2 Change this paragraph if this algorithm was presented in Chapter Three. say X1 . 1990. we delete from h any Boolean variables appearing in Xi with a sign diﬀerent from their sign in h. 1.2 Terms Examples Let H be the set of terms (conjunctions of literals). consider the following patterns. and 1/δ.01 and δ = 0.
Linear programming is polynomial.112 CHAPTER 8.2. note that the bound on m increases only polynomially with n. . ε = 0. COMPUTATIONAL LEARNING THEORY m ≥ (1/ε) n2 ln 2 + ln(1/δ) Again. 748 guarantees PAC learnability. 1/ε. and 1/δ. m ≥ 173. with (0. 8. m.) H terms kterm DNF (k disjunctive terms) kDNF (a disjunction of ksized terms) kCNF (a conjunction of ksized clauses) kDL (decision lists with ksized terms) lin. Show that the sample size.1) weights k2NN DNF (all Boolean functions) H 3n 2O(kn) 2O(n 2O(n 2O(n k k Time Complexity polynomial NPhard polynomial polynomial polynomial polynomial NPhard NPhard polynomial P. needed to ensure PAC learnability is polynomial (or better) in (1/ε). b. sep. we would have additionally to show that one can ﬁnd in time polynomial in m and n a hypothesis h consistent with a set of m labeled linearly separable patterns. To show that linearly separable functions are properly PAC learnable. feedforward neural networks with exactly k hidden units and one output unit. and n by showing that ln H is polynomial or better in the number of dimensions.01 and δ = 0. (Adapted from [Dietterich.3 Some Properly PACLearnable Classes Some properly PAClearnable classes of functions are given in the following table. 1990. (1/δ). pages 262 and 268] which also gives references to proofs of some of the time complexities.01. lin. Learnable? yes no yes yes yes yes no no no ) k ) k lg n) 2O(n ? ? n 22 2 ) (Members of the class k2NN are twolayer. Show that there is an algorithm that produces a consistent hypothesis on m ndimensional samples in time polynomial in m and n. For n = 50.) Summary: In order to show that a class of functions is Properly PACLearnable : a. sep.
page 41617] says: “ . We should examine the model to see what constraints can be relaxed and made more realistic. as shown in Fig. THE VAPNIKCHERVONENKIS DIMENSION 113 As hinted earlier. the n components of these patterns are real numbers.1 The VapnikChervonenkis Dimension Linear Dichotomies Consider a set. And yet. H. Therefore. Rn . of functions. and a set. then H shatters Ξ. H.8. of m patterns in all 2m ways. we say that if a subset. there is little likelihood that such a hypothesis will generalize well beyond the training set. consider a set Ξ of m patterns in the ndimensional space. linearly separable functions implemented by TLUs whose weight values are restricted to 0 and 1 are not properly PAC learnable. ktermDNF is a subclass of kCNF! So. is its ability to make arbitrary classiﬁcations of the patterns in Ξ. One measure of the expressive power of a set of hypotheses. even in the nonBoolean case). there are n 22 ways to dichotomize them. Similarly. of functions can dichotomize a set. But a subset. For example.3 8. feedforward neural networks is not polynomially PAC learnable is more an attack on the theory than it is on the networks. Ξ. Ξ. one would be able to ﬁnd a hypothesis in kCNF that is probably approximately correct for the target function. Ξ. there are 2m diﬀerent ways to divide these patterns into two disjoint and exhaustive subsets. even if the target function were in ktermDNF. if a hypothesis drawn from a set that could make arbitrary classiﬁcations of a set of training patterns.) Although PAC learning theory is a powerful analytic tool.” 8. but ktermDNF is not. How many linear dichotomies of m patterns in n dimensions are there? For example. As an example. 8. of (unlabeled) patterns. (At the time of writing. If Ξ were to include all of the 2n Boolean patterns. of the Boolean functions might not be able to dichotomize an arbitrary set.3. . a proof within some model of learning that learning is not feasible is an indictment of the model. The fact that the class of twolayer. of course. relative to Ξ. In general (that is. for example.3. it (like complexity theory) deals mainly with worstcase results. and (of course) the set of all possible Boolean functions dichotomizes them in all of these ways.1 If there are m patterns in Ξ. of m Boolean patterns in all 2m ways. An interesting question is whether or not the class of functions in k2NN is polynomially PAC learnable if the hypotheses are drawn from k 2NN with k > k. whereas unrestricted linearly separable functions are. H. . . sometimes enlarging the class of hypotheses makes learning easier. which have had many successful applications. this matter is still undecided.1. We say there are 2m diﬀerent dichotomies of Ξ. As [Baum. there are 14 dichotomies 1 And. (That is. 1994. humans are capable of learning in the natural world.) We deﬁne a linear dichotomy as one implemented by an (n − 1)dimensional hyperplane in the ndimensional space. It is possible that enlarging the space of hypotheses makes ﬁnding one that is consistent with the training examples easier. the table above shows that kCNF is PAC learnable.
114
CHAPTER 8. COMPUTATIONAL LEARNING THEORY
of four points in two dimensions (each separating line yields two dichotomies depending on whether the points on one side of the line are classiﬁed as 1 or 0). (Note that even though there are an inﬁnite number of hyperplanes, there are, nevertheless, only a ﬁnite number of ways in which hyperplanes can dichotomize a ﬁnite number of patterns. Small movements of a hyperplane typically do not change the classiﬁcations of any patterns.)
7
6 5
2 3
1
4
14 dichotomies of 4 points in 2 dimensions
Figure 8.1: Dichotomizing Points in Two Dimensions The number of dichotomies achievable by hyperplanes depends on how the patterns are disposed. For the maximum number of linear dichotomies, the points must be in what is called general position. For m > n, we say that a set of m points is in general position in an ndimensional space if and only if no subset of (n+1) points lies on an (n−1)dimensional hyperplane. When m ≤ n, a set of m points is in general position if no (m − 2)dimensional hyperplane contains the set. Thus, for example, a set of m ≥ 4 points is in general position in a threedimensional space if no four of them lie on a (twodimensional) plane. We will denote the number of linear dichotomies of m points in general position in an ndimensional space by the expression ΠL (m, n).
Include the derivation.
It is not too diﬃcult to verify that:
n
ΠL (m, n) = 2
i=0
C(m − 1, i)
for m > n, and
= 2m
for m ≤ n
8.3. THE VAPNIKCHERVONENKIS DIMENSION where C(m − 1, i) is the binomial coeﬃcient
(m−1)! (m−1−i)!i! .
115
The table below shows some values for ΠL (m, n). m (no. of patterns) 1 2 3 4 5 6 7 8 1 2 4 6 8 10 12 14 16 n (dimension) 2 3 4 2 2 2 4 4 4 8 8 8 14 16 16 22 30 32 32 52 62 44 84 114 58 128 198
5 2 4 8 16 32 64 126 240
Note that the class of linear dichotomies shatters the m patterns if m ≤ n + 1. The boldface entries in the table correspond to the highest values of m for which linear dichotomies shatter m patterns in n dimensions.
8.3.2
Capacity
(m,n) Let Pm,n = ΠL2m = the probability that a randomly selected dichotomy (out m of the 2 possible dichotomies of m patterns in n dimensions) will be linearly separable. In Fig. 8.2 we plot Pλ(n+1),n versus λ and n, where λ = m/(n + 1). Note that for large n (say n > 30) how quickly Pm,n falls from 1 to 0 as m goes above 2(n + 1). For m < 2(n + 1), any dichotomy of the m points is almost certainly linearly separable. But for m > 2(n + 1), a randomly selected dichotomy of the m points is almost certainly not linearly separable. For this reason m = 2(n + 1) is called the capacity of a TLU [Cover, 1965]. Unless the number of training patterns exceeds the capacity, the fact that a TLU separates those training patterns according to their labels means nothing in terms of how well that TLU will generalize to new patterns. There is nothing special about a separation found for m < 2(n + 1) patterns—almost any dichotomy of those patterns would have been linearly separable. To make sure that the separation found is forced by the training set and thus generalizes well, it has to be the case that there are very few linearly separable functions that would separate the m training patterns. Analogous results about the generalizing abilities of neural networks have been developed by [Baum & Haussler, 1989] and given intuitive and experimental justiﬁcation in [Baum, 1994, page 438]:
“The results seemed to indicate the following heuristic rule holds. If M examples [can be correctly classiﬁed by] a net with W weights (for M >> W ), the net will make a fraction ε of errors on new examples chosen from the same [uniform] distribution where ε = W/M .”
116
CHAPTER 8. COMPUTATIONAL LEARNING THEORY
P (n + 1), n n
1 0.75 75 0.5 .5 0.25 25 0 0 1 2 3 4 10
50 40 30 20
Figure 8.2: Probability that a Random Dichotomy is Linearly Separable
8.3.3
A More General Capacity Result
Corollary 7.2 gave us an expression for the number of training patterns suﬃcient to guarantee a required level of generalization—assuming that the function we were guessing was a function belonging to a class of known and ﬁnite cardinality. The capacity result just presented applies to linearly separable functions for nonbinary patterns. We can extend these ideas to general dichotomies of nonbinary patterns. In general, let us denote the maximum number of dichotomies of any set of m ndimensional patterns by hypotheses in H as ΠH (m, n). The number of dichotomies will, of course, depend on the disposition of the m points in the ndimensional space; we take ΠH (m, n) to be the maximum over all possible arrangements of the m points. (In the case of the class of linearly separable functions, the maximum number is achieved when the m points are in general position.) For each class, H, there will be some maximum value of m for which ΠH (m, n) = 2m , that is, for which H shatters the m patterns. This maximum number is called the VapnikChervonenkis (VC) dimension and is denoted by VCdim(H) [Vapnik & Chervonenkis, 1971]. We saw that for the class of linear dichotomies, VCdim(Linear) = (n + 1). As another example, let us calculate the VC dimension of the hypothesis space of single intervals on the real line—used to classify points on the real line. We show an example of how points on the line might be dichotomized by a single interval in Fig. 8.3. The set Ξ could be, for example, {0.5, 2.5,  2.3, 3.14}, and one of the hypotheses in our set would be [1, 4.5]. This hypothesis would label the points 2.5 and 3.14 with a 1 and the points  2.3 and 0.5 with a 0. This
8. Therefore the VC dimension of single intervals on the real line is 2. page 438] gives experimental evidence for the proposition that “ . TLUs with n inputs have a VC dimension of n + 1.4 Some Facts and Speculations About the VC Dimension • If there are a ﬁnite number. . • [Baum.3. But no single interval can classify three points such that the outer two are classiﬁed as 1 and the inner one as 0. 8.3. then we begin to have good generalization. multilayer [neural] nets have a VC dimension roughly equal to their total number of [adjustable] weights. 1994. As soon as we have many more than 2 training patterns on the real line and provided we know that the classiﬁcation function we are trying to guess is a single interval. then: VCdim(H) ≤ log(H) • The VC dimension of terms in n dimensions is n.” . Figure 8. . of hypotheses in H. Since any dichotomy of VCdim(H) or fewer patterns in general position in n dimensions can be achieved by some hypothesis in H. A hypothesis space consisting of conjunctions of these tests (called axisparallel hyperrectangles) has VC dimension bounded by: n ≤ VCdim ≤ 2n • As we have already seen.3: Dichotomizing Points by an Interval The VC dimension is a useful measure of the expressive power of a hypothesis set. • Suppose we generalize our example that used a hypothesis set of single intervals on the real line. THE VAPNIKCHERVONENKIS DIMENSION 117 set of hypotheses (single intervals on the real line) can arbitrarily classify any two points. Our examples have shown that the concept of VC dimension is not restricted to Boolean functions. we must have many more than VCdim(H) patterns in the training set in order that a hypothesis consistent with the training set is suﬃciently constrained to imply good generalization. We allow only one such test per dimension. H. Now let us consider an ndimensional feature space and tests of the form Li ≤ xi ≤ Hi .
5 Any PAC learning algorithm Ω(1/ε lg(1/δ) + VCdim(H)) training patterns.01. The VC dimension is 2 (as shown previously). m ≥ 16.) A hypothesis space H is PAC learnable iﬀ it has ﬁnite VC dimension. Theorem 8. H consistent with the The second of these two theorems improves the bound on the number of training patterns needed for linearly separable functions to one that is linear in n. ε = 0. 756. et al. We state these here without proof. 551 ensures PAC learnability.118 CHAPTER 8.4 A set of hypotheses. 1988]: Theorem 8.3 (Blumer. In our previous example of how many training patterns were needed to ensure PAC learnability of a linearly separable function if n = 50.01. result we would get m ≥ 52. H. 1990]. et al. Theorem 8. the lower must examine at least and upper bounds is 8. As another example of the second theorem.5 To be added. With n = 50. m ≥ (1/ε) max [4 lg(2/δ).. et al.01.01. if there is an algorithm that outputs a hypothesis h training set in polynomial (in m and n) time. Using the Blumer. ε = 0. There is also a theorem that gives a lower (necessary) bound on the number of training patterns required for PAC learning [Ehrenfeucht. 8 VCdim lg(13/ε)].4 VC Dimension and PAC Learning There are two theorems that connect the idea of VC dimension with PAC learning [Blumer. Bibliographical and Historical Remarks . The diﬀerence between O(log(1/ε)VCdim(H)/ε). 748. we obtained m ≥ 173. let us take H to be the set of closed intervals on the real line. and b. COMPUTATIONAL LEARNING THEORY 8. is properly PAC learnable if: a. and δ = 0.. and δ = 0. et al.
ΞR . . while the second (b) seems diﬃcult to partition at all. We will explain shortly various methods for deciding how many clusters there should be and for separating a set of patterns into that many clusters. and their motivation. • Design a classiﬁer based on the labels assigned to the training patterns by the partition. Another type of unsupervised learning involves ﬁnding hierarchies of partitionings or clusters of clusters.1. called clusters. A hierarchical partition is one in which Ξ is 119 . In that setting. One encoding involves a description of each point separately. 1988] discovered a new classiﬁcation of stars based on the properties of infrared sources. One of the MDLbased methods. 9. other. perhaps shorter. . encodings might involve a description of clusters of points together with how each point in a cluster can be described given the cluster it belongs to. . itself.Chapter 9 Unsupervised Learning 9. The ﬁrst set (a) seems naturally partitionable into two classes. we assume that we want to encode a description of a set of points. but the MDL method has been applied with success. et al. and the third (c) is problematic. into a message of minimal length. There are two stages: • Form an Rway partition of a set Ξ of unlabeled training patterns (where the value of R. . may need to be induced from the patterns). We can base some of these methods.1 What is Unsupervised Learning? Consider the various sets of points in a twodimensional space illustrated in Fig. Ξ1 .. on minimumdescriptionlength (MDL) principles. Autoclass II [Cheeseman. Unsupervised learning uses procedures that attempt to ﬁnd natural partitions of patterns. The partition separates Ξ into R mutually exclusive and exhaustive subsets. Ξ. The speciﬁc techniques described in this chapter do not explicitly make use of MDL principles.
For patterns whose features are numeric. 9. We present the (unlabeled) patterns in the training set. Ξ1 . The simplest of these involves deﬁning a distance between patterns. . 9. Ξi . The hierarchical form is best displayed as a tree. iterative clustering method based on distance. There is a simple.2. We show an example of such a hierarchical partition in Fig. . . . the distance measure can be ordinary Euclidean distance between two points in an ndimensional space. . Suppose we have R randomly chosen cluster seekers. . to the algorithm . It can be described as follows. . as shown in Fig. and so on. each set. ΞR . These are points in an ndimensional space that we want to adjust so that they each move toward the center of one of the clusters of patterns. . C1 .3.2 9. R) is divided into mutually exclusive and exhaustive subsets. . CR . Ξ. . 9. UNSUPERVISED LEARNING a) two clusters b) one cluster c) ? Figure 9.1: Unlabeled Patterns divided into mutually exclusive and exhaustive subsets. One application of such hierarchical partitions is in organizing individuals into taxonomic hierarchies such as those used in botany and zoology. . (i = 1. .120 CHAPTER 9. The tip nodes of the tree can further be expanded into their individual pattern elements.2.1 Clustering Methods A Method Based on Euclidean Distance Most of the unsupervised learning methods use a measure of similarity between patterns in order to group them into clusters.
As a cluster seeker’s mass increases it moves less far towards a pattern. Xi . equal to the number of times that it has moved. Suppose each cluster seeker. CLUSTERING METHODS 121 21 21 22 23 = 2 22 11 11 12 = 1 23 12 1 31 31 32 32 = 3 2 3= Figure 9. we ﬁnd that cluster seeker. Intuitively. Cj . that is closest to Xi and move it closer to Xi : Cj ←− (1 − αj )Cj + αj Xi where αj is a learning rate parameter for the jth cluster seeker. it determines how far Cj is moved toward Xi . mj . For each pattern. presented. we might set αj = 1/(1 + mj ) and use the above rule together with mj ←− mj + 1. a cluster seeker is always at the center of gravity (sample mean) of the set of patterns toward which it has so far moved. With this adjustment rule. if a cluster seeker ever gets within some reasonably well clustered set of patterns (and if that cluster seeker is the only one so located).9. Reﬁnements on this procedure make the cluster seekers move less far as training proceeds. it will converge to the center of gravity of that cluster. has a mass.2: A Hierarchy of Clusters onebyone. Cj . . For example.2.
3: Displaying a Hierarchy as a Tree Once the cluster seekers have converged. we seek clusters whose patterns are as close together as possible. by computing its sample variance deﬁned by: V = (1/K) i (Xi − M)2 where M is the sample mean of the cluster. We can measure the badness. can be implemented by a linear machine. In this way.122 CHAPTER 9. Georgy Fedoseevich Voronoi. so we must arrange that our measure of the badness of a partition must increase with the number of clusters. 9. an example of which is shown in Fig.4. V . We would like to partition a set of patterns into clusters such that the sum of the sample variances (badnesses) of these clusters is small. This kind of classiﬁcation. {Xi }. When basing partitioning on distance. the classiﬁer implied by the nowlabeled patterns in Ξ can be based on a Voronoi partitioning of the space (based on distances to the various cluster seekers). the sample variances will all be zero. Of course if we have one cluster for each pattern. UNSUPERVISED LEARNING 1 3 2 11 12 31 32 21 22 23 Figure 9. of a cluster of patterns. which is deﬁned to be: M = (1/K) i Xi and K is the number of points in the cluster. was a Russian mathematician who lived from 1868 to 1909. we can seek a tradeoﬀ between the variances of .
ever falls below some threshold ε. For example.e. say Ci . In this way we can decrease the overall badness of a partition by reducing the number of clusters for comparatively little penalty in increased variance. . In this way the badness of the partition might ultimately decrease by decreasing the total sample variance with comparatively little penalty for the additional cluster seeker. Elaborations of our basic clusterseeking procedure allow the number of cluster seekers to vary depending on the distances between them and depending on the sample variances of the clusters. Cj . Ci and Cj . In distancebased methods. between two cluster seekers. the square root of the variance) of each of the components over the entire training set and normalize the values of the components so that their adjusted standard deviations are equal. One commonly used technique is to compute the standard deviation (i. On the other hand.4: MinimumDistance Classiﬁcation the clusters and the number of them in a way somewhat similar to the principle of minimal description length discussed earlier. The values of the parameters ε and δ are set depending on the relative weights given to sample variances and numbers of clusters.9. it is important to scale the components of the pattern vectors. if the distance. then we can place a new cluster seeker. CLUSTERING METHODS 123 Separating boundaries C1 C3 C2 Figure 9.2.. deﬁnes a cluster whose sample variance is larger than some amount δ. if any of the cluster seekers. The variation of values along some dimensions of the pattern vector may be much diﬀerent than that of other dimensions. dij . at some random location somewhat adjacent to Ci and reset the masses of both Ci and Cj to zero. then we can replace them both by a single cluster seeker placed at their center of gravity (taking into account their respective masses).
As we saw earlier. X. Ci and Cj if (Mi − Mj )2 < ε.) Suppose the largest of these similarities is S(X. . L. providing p(Ci X) is larger than some ﬁxed threshold. . Ci ) = p(x1 Ci )p(x2 Ci ) · · · p(xn Ci )p(Ci ) The p(xj Ci ) can be estimated from the sample statistics of the patterns in the clusters and then used in the above expression. Compute new sample statistics p(x1 Cmerge ).2. compute S(X. We can base an iterative clustering algorithm on this measure of similarity [Mahadevan & Connell. (Recall the linear form that this formula took in the case of binaryvalued components. (Initially. . . . providing the similarity is larger than δ. Thus. . (b) If S(X. δ. Ci . we can deﬁne the sample mean of a cluster. Ci ) the similarity of X to a cluster. Cmerge = Ci ∪ Cj . Ci ) for each cluster. We can decide to which of these clusters some arbitrary pattern. Cnew = {X} and add Cnew to L. to be: Mi = (1/Ki ) Xj Ci Xj where Ki is the number of patterns in Ci . assign X to Cmax . CR . . . Assuming conditional independence of the pattern components. . Cmax ) ≤ δ. into R mutually exclusive and exhaustive clusters. C1 . p(x2 Cmerge ). Ci . the quantity to be maximized is: S(X. . For the next pattern. (a) If S(X. is largest.124 CHAPTER 9. p(x2 Cmax ).2 A Method Based on Probabilities Suppose we have a partition of the training set.) We call S(X. Ξ. p(xn Cmerge ). Go to 3. b. in Ξ. Ci . create a new cluster. X. . and p(Cmerge ) for the merged cluster. should be assigned by selecting the Ci for which the probability. and p(Cmax ) to take the new pattern into account. p(Ci X). these similarities are all zero. we can use Bayes rule and base our decision on maximizing p(XCi )p(Ci ). Go to 3. . Cmax ←− Cmax ∪ {X} Update the sample statistics p(x1 Cmax ). UNSUPERVISED LEARNING 9. 1992]. That is. Cmax ). of patterns. xi . p(xn Cmax ). Begin with a set of unlabeled patterns Ξ and an empty list. Just as before. Cmax ) > δ. of clusters. Merge any existing clusters. . c. we assign X to the cluster to which it is most similar. It can be described as follows: a.
equal to the average of Xi and Xj . there will be a small number of clusters with many patterns in each cluster. These clusters can be organized hierarchically in a binary tree with cluster 9 as root. we form a new cluster. of unlabeled training patterns. A ternary tree could be formed instead if one searches for the three points in Ξ whose triangle deﬁned by those patterns has minimal area. In this case. and so on. Mention “kmeans and “EM” methods.3. otherwise go to 2. C. to that category that maximizes S(X.3 9. we replace Cj and Xi in Ξ by their (appropriately weighted) average and continue. we form a new cluster. the larger the value of ε.9. consisting of the union of Cj and {Xi }. then terminate with the clusters in L. In this case. An example of how this method aggregates a set of two dimensional patterns is shown in Fig.3. C. appropriate scaling of the dimensions is assumed. Designing a classiﬁer based on the patterns labeled by the partitioning is straightforward. (Again. Xi . and a cluster vector. clusters 7 and 8 as the two descendants of the root. If the shortest distance is between two cluster vectors. we form a new cluster. as before and replace the pair of patterns in Ξ by their average. Next we compute the Euclidean distance again between all pairs of points in Ξ. we leave it to the reader to generate a precise algorithm. Cj (representing a cluster. The numbers associated with each cluster indicate the order in which they were formed. 9. Cj ). (The description of this algorithm is based on an unpublished manuscript by Pat Langley. Similarly. we replace Ci and Cj by their (appropriately weighted) average and continue. Ξ. The value of the parameter δ controls the number of clusters. we ultimately terminate with a tree of clusters rooted in the cluster containing all of the points in the original training set. there will be a large number of clusters with few patterns in each cluster. 9. If the sample statistics of the clusters have not changed during an entire iteration through Ξ. C. Ci ). the smaller the number clusters that will be found.) Suppose the smallest distance is between patterns Xi and Xj . If the smallest distance is between pairs of patterns. .) Our description here gives the general idea. C. HIERARCHICAL CLUSTERING METHODS 125 d. eliminate Xi and Xj from Ξ and replace them by a cluster vector. We collect Xi and Xj into a cluster. If δ is high. We ﬁrst compute the Euclidean distance between all pairs of patterns in Ξ. Ci and Cj . C. X. If the shortest distance is between a pattern. For small values of δ.5. Since we reduce the number of points in Ξ by one each time.1 Hierarchical Clustering Methods A Method Based on Euclidean Distance Suppose we have a set. consisting of the union of Ci and Cj . We can form a hierarchical classiﬁcation of the patterns in Ξ by a simple agglomerative method. We assign any pattern.
the probability that we guess the ith component correctly is: probability(guess is vij )pi (vij Ck ) = j j [pi (vij Ck )] 2 The average number of (the n) components whose values are guessed correctly by this method is then given by the sum of these probabilities over all of the components of X: [pi (vij Ck )] i j 2 . Guess that xi = vij with probability pi (vij Ck ). .2 A Method Based on Probabilities A probabilistic quality measure for partitions We can develop a measure of the goodness of a partitioning based on how accurately we can guess a pattern given only what partition it is in. . we can compute the sample statistics p(xi Ck ) which give probability values for each component given the class assigned to it by the partitioning. We use the notation pi (vij Ck ) = probability(xi = vij Ck ). UNSUPERVISED LEARNING 4 1 2 3 7 9 8 5 6 Figure 9.3. . Then. where the index j steps over the domain of that component. Suppose we are given a partitioning of Ξ into R classes. C1 . As before. . CR .5: Agglommerative Clustering 9. Suppose we use the following probabilistic guessing rule about the values of the components of a vector X given only that it is in class k. Suppose each component xi of X can take on the values vij .126 CHAPTER 9.
Finally. P2 = {{a. not quite as high as Z(P1 ). d}}. A similar calculation also gives 2 1/2 for class 2. Summing over the three components gives 3/2. gives the following sample probabilities: p1 (v11 = 1C1 ) = 1 p2 (v21 = 1C1 ) = 1/2 p3 (v31 = 1C1 ) = 1 Summing over the values of the components (0 and 1) gives (1)2 + (0)2 = 1 for component 1. {d}}. and P4 = {{a}. c. {b}. {c. The sample probabilities pi (vi1 = 1) and pi (vi0 = 0) are all equal to 1/2 for each of the three components.6. Let’s evaluate Z values for the following ones: P1 = {a. Summing over the three components gives 2 1/2 for class 1. c}. b. dividing by the number of clusters produces the ﬁnal Z value of this partition. Z(P2 ) = 1 1/4. dividing by the number of clusters produces the ﬁnal Z value of this partition. HIERARCHICAL CLUSTERING METHODS 127 Given our partitioning into R classes. P1 . and (1)2 + (0)2 = 1 for component 3. The ﬁrst. 9.9. (1/2)2 + (1/2)2 = 1/2 for component 2. the goodness measure. Finally. There are several diﬀerent partitionings. The second partition. Summing over the values of the components (0 and 1) gives (1/2)2 + (1/2)2 = 1/2. so this method of evaluating partitions would favor placing all patterns in a single cluster. d}}. Similar calculations yield Z(P3 ) = 1 and Z(P4 ) = 3/4. Averaging over all of the clusters (there is just one) also gives 3/2. Z(P1 ) = 3/2. Averaging over the two clusters also gives 2 1/2.3. puts all of the patterns into a single cluster. {b. . P2 . G. we divide it by R to get an overall “quality” measure of a partitioning: Z = (1/R) k p(Ck ) i j [pi (vij Ck )] 2 We give an example of the use of this measure for a trivially simple clustering of the four threedimensional patterns shown in Fig. d}. b}. of this partitioning is the average of the above expression over all classes: G= k p(Ck ) i j [pi (vij Ck )] 2 where p(Ck ) is the probability that a pattern is in class Ck . {c}. P3 = {{a. In order to penalize this measure for having a large number of classes.
η. calculate the best host for Xi . UNSUPERVISED LEARNING x3 d b c a x2 x1 Figure 9. The tips of the tree will contain singleton sets. For each of the successors of µ (including the empty successor!). d. We arrange that at all times during the process every nonempty node in the tree has (besides any other successors) exactly one empty successor.6: Patterns in 3Dimensional Space An iterative method for hierarchical clustering Evaluating all partitionings of m patterns and then selecting the best would be computationally intractable. sample statistics are used to update the Z values whenever a pattern is placed at a node. We start with a tree whose root node contains all of the patterns in Ξ and a single empty successor node. b. c. are labeled by mutually exclusive and exhaustive subsets of the pattern set labelling node η. The following iterative method is based on a hierarchical clustering procedure called COBWEB [Fisher. 1987]. The algorithm is as follows: a. the root node contains all of the patterns in Ξ. At the end of the process. The procedure grows a tree each node of which is labeled by a set of patterns. The method uses Z values to place patterns at the various nodes. A best host is determined by tentatively placing Xi in one of the successors and calculating the resulting Z value for each . terminate).128 CHAPTER 9. In general. Select a pattern Xi in Ξ (if there are no more patterns to select. Set µ to the root node. the successors of a node. The successors of the root node will contain mutually exclusive and exhaustive subsets of Ξ.
the COBWEB procedure incorporates node merging and splitting. and go to 4. Rather than try all pairs to merge. singleton (tip) node. and the two nodes that were merged are installed as successors of the new node. If the best host is a nonempty. create one successor node of η containing the singleton pattern that was in η. Issue Toxic Waste Budget Cuts SDI Reduction Contra Aid LineItem Veto MX Production Class 1 yes yes no yes yes yes Class 2 no no yes no no no . generate an empty sibling node of η. nonsingleton node. η. If the best host is a nonempty. After the clusters were established. In the ﬁrst. set µ to η. create another successor node of η containing Xi . η. the program attempted to ﬁnd two categories (we will call them Class 1 and Class 2) of United States Senators based on their votes (yes or no) on six issues. we place Xi in η. To make the ﬁnal classiﬁcation tree less order dependent. When such a merging improves the Z value. HIERARCHICAL CLUSTERING METHODS 129 one of these ways of accomodating Xi . If the best host is an empty node.3. f. a new node containing the union of the patterns in the merged nodes replaces the merged nodes. a good heuristic is to attempt to merge the two best hosts. we place Xi in η. we place Xi in η. Example results from COBWEB We mention two experiments with COBWEB. The best host corresponds to the assignment with the highest Z value. Node merging: It may happen that two nodes having the same parent could be merged with an overall increase in the quality of the resulting classiﬁcation performed by the successors of that parent. g. This process is rather sensitive to the order in which patterns are presented. generate an empty successor node of η. create an empty successor node of η. create empty successor nodes of the new nonempty successors of η. η. Node splitting: A heuristic for node splitting is to consider replacing the best host among a group of siblings by that host’s successors. This operation is performed only if it increases the Z value of the classiﬁcation performed by a group of siblings. These are shown in the table below.9. e. and go to 2. and go to 2. the majority vote in each class was computed.
9. Bibliographical and Historical Remarks . the program attempted to classify soybean diseases based on various characteristics. COBWEB grouped the diseases in the taxonomy shown in Fig. UNSUPERVISED LEARNING In the second experiment.7: Taxonomy Induced for Soybean Diseases 9.7. N0 soybean diseases N1 Diaporthe Stem Canker N2 Charcoal Rot N3 N31 Rhizoctonia Rot N32 Phytophthora Rot Figure 9.130 CHAPTER 9.4 To be added.
and that is the problem we consider here. X. . i = 1. we consider problems in which we wish to learn to predict the future value of some quantity. X2 . the patterns occur in temporal sequence. we desire to make a sequence of predictions about the value of z at some ﬁxed time. The components of Xi are features whose values are available at time. .. X1 . Sutton [Sutton.2 Supervised and TemporalDiﬀerence Methods A training method that naturally suggests itself is to use the actual value of z at time m + 1 (once it is known) in a supervised learning procedure using a 131 . . Xi+1 . In one. . We distinguish two kinds of prediction problems. . we desire to predict the value of z at time t = i + 1 based on input Xi for every i. . For example. Xm . m.1 Temporal Patterns and Prediction Problems In this chapter. In the other kind of prediction problem. based on each of the Xi . . In many of these problems. based on measurements taken every day before New Year’s. and are generated by a dynamical process. Xi . In multistep prediction.Chapter 10 TemporalDiﬀerence Learning 10. . say z. t = i. from an ndimensional input pattern. we might wish to make a series of predictions about some aspect of the weather on next New Year’s Day. multistep prediction. . For example. we might expect that the prediction accuracy should get better and better as i increases toward m. 10. 1988] has called this latter problem.. . we might wish to predict some aspects of tomorrow’s weather based on a set of measurements made today. say t = m + 1.
. {X1 . we change W as follows: m W ←− W + i=1 (∆W)i Whenever we are attempting to minimize the squared error between z and f (Xi . to be made to W. . f (X). we seek to learn a function. Such methods involve what is called temporaldiﬀerence (TD) learning. X2 . . such that f (Xi ) is as close as possible to z for each i. For supervised learning. (∆Wi ).. W) = X • W. Then: (∆W)i = c(z − fi )Xi An interesting form for (∆W)i can be developed if we note that m (z − fi ) = k=i (fk+1 − fk ) where we deﬁne fm+1 = z. Then. We assume that our prediction. TEMPORALDIFFERENCE LEARNING sequence of training patterns. taking into account the weight changes for each pattern in a sequence all at once after having made all of the predictions with the old weight vector. . . by deﬁnition. f . That is. ∂wii .132 CHAPTER 10. and the learning rule (whatever it is) computes the change. . Xi+1 . the weightchanging rule for each pattern is: (∆W)i = c(z − fi ) ∂fi ∂W where c is a learning rate parameter. The WidrowHoﬀ rule results when f (X.. . W. and ∂W is. Typically. . consisting of several such sequences.) The reader will recall that we used an equivalent expression for (∆W)i in deriving the backpropagation formulas used in training multilayer neural networks. we write f (X. f (Xi . ∂wi ) in which the wi are the individual components of W. depends on a vector of modiﬁable weights. . . We will show that a method that is better than supervised learning for some important problems is to base learning on the diﬀerence between f (Xi+1 ) and f (Xi ) rather than on the diﬀerence between z and f (Xi ). ∂fi at time t = i. the vector of partial derivatives ∂f ∂f ∂f ( ∂wi . Xi . W). . W) is computed and compared to z. Ξ. we would need a training set. 1 n ∂fi (The expression ∂W is sometimes written W fi . the prediction f (Xi . . To make that dependence explicit. W). we consider procedures of the following type: For each Xi . fi is our prediction of z. Xm }. Substituting in our formula for (∆W)i yields: (∆W)i = c(z − fi ) ∂fi ∂W . W) by gradient descent. . .
we have various degrees of unsupervised learning. but as λ → 0.10. the error is the diﬀerence between successive predictions. When λ = 1. Here.2. the λ term gives exponentially decreasing weight to diﬀerences later in time than t = i. the temporal diﬀerence form of the WidrowHoﬀ rule is: m (∆W)i = cXi k=i (fk+1 − fk ) One reason for writing (∆W)i in temporaldiﬀerence form is to permit an interesting generalization as follows: (∆W)i = c ∂fi ∂W m λ(k−i) (fk+1 − fk ) k=i where 0 < λ ≤ 1. only the error term is diﬀerent. W) = X • W. we have the same rule with which we began—weighting all diﬀerences equally. SUPERVISED AND TEMPORALDIFFERENCE METHODS 133 =c ∂fi ∂W m (fk+1 − fk ) k=i In this form. in which the prediction function strives to make each prediction more like successive ones (whatever they might be). For λ < 1. It is interesting to compare the two extreme cases: For TD(0): (∆W)i = c(fi+1 − fi ) For TD(1): (∆W)i = c(z − fi ) ∂fi ∂W ∂fi ∂W Both extremes can be handled by the same learning mechanism. the method is called TD(λ). Intermediate values of λ take into account diﬀerently weighted diﬀerences between future pairs of successive predictions. In the case when f (X. we use the diﬀerences between successive predictions—thus the phrase temporaldiﬀerence (TD) learning. We shall soon see that these unsupervised procedures result in better learning than do the supervised ones for an important class of problems. instead of using the diﬀerence between a prediction and the value of z. sensitive to the ﬁnal value of z provided by the teacher. the error is the diﬀerence between the ﬁnally revealed value of z and the prediction. . and in TD(1). With the λ term. In TD(0). Only TD(1) can be considered a pure supervised learning procedure. we weight only the (fi+1 − fi ) diﬀerence.
134 CHAPTER 10. as earlier. we want to use an expression of the form W ←− W+ we see that we can write: i (∆W)i = c(fi+1 − fi ) k=1 i λ(i−k) ∂fk ∂W ∂fk Now. TEMPORALDIFFERENCE LEARNING 10. if we let ei = k=1 λ(i−k) ∂W .3 Incremental Computation of the (∆W)i m We can rewrite our formula for (∆W)i . we can develop a computationally eﬃcient recurrence equation for ei+1 as follows: i+1 ei+1 = k=1 λ(i+1−k) ∂fk ∂W = ∂fi+1 + ∂W i λ(i+1−k) k=1 ∂fk ∂W . namely (∆W)i = c ∂fi ∂W λ(k−i) (fk+1 − fk ) k=i to allow a type of incremental computation. If. First we write the expression for the weight change rule that takes into account all of the (∆W)i : m W ←− W + i=1 c ∂fi ∂W m λ(k−i) (fk+1 − fk ) k=i Interchanging the order of the summations yields: m k W ←− W + k=1 c i=1 λ(k−i) (fk+1 − fk ) ∂fi ∂W m k =W+ k=1 c(fk+1 − fk ) i=1 λ(k−i) ∂fi ∂W Interchanging the indices k and i ﬁnally yields: m i W ←− W + i=1 c(fi+1 − fi ) k=1 λ(i−k) ∂fk ∂W m i=1 (∆W)i .
In Fig. . this equation can be computed incrementally. because each (∆W)i depends only on a pair of successive predictions and on the ∂fi [weighted] sum of all past values for ∂W . sequences of vectors. . 1988. 10. but the quote applies equally well to this one): “. Some sample sequences are shown in the ﬁgure. it is equally likely that the sequence terminates with z = 1 or that the next vector is XE . it is equally likely that the sequence terminates with z = 0 or that the next vector is XC . sequences of temporally presented patterns contain important information that is ignored by a conventional supervised method such as the WidrowHoﬀ rule. because it is no longer necessary to individually remember ∂fi all past values of ∂W .1.4.” 10. page 19] gives an interesting example involving a random walk. are generated as follows: We start with vector XD . the next one after that is equally likely to be one of the vectors adjacent to XC (or XE ). the next vector in the sequence is equally likely to be one of the adjacent vectors in the diagram. we obtain: (∆W)i = c(fi+1 − fi )ei where: e1 = ∂f1 ∂W e2 = etc. page 15] (about a diﬀerent equation.10. when XF is in the sequence. If the next vector is XC (or XE ). but they always start with XD . ∂f2 + λe1 ∂W Quoting Sutton [Sutton. Similarly. AN EXPERIMENT WITH TD METHODS 135 = ∂fi+1 + λei ∂W Rewriting (∆W)i in these terms.4 An Experiment with TD Methods TD prediction methods [especially TD(0)] are well suited to situations in which the patterns are generated by a dynamic process. In that case. which we repeat here. . X. This saves substantially on memory. Thus the sequences are random. When XB is in the sequence. 1988. Sutton [Sutton.
The experimental setup was as follows: ten random sequences were generated using the transition probabilities. W. f (XD ) = w3 . Weight vector increments. f (XE ) = w4 . Given the ﬁve diﬀerent values that X can take on. This process was repeated over and over (using the same training sequences) until (quoting Sutton) “the procedure no longer produced any signiﬁcant changes in the weight vector. For small c. f (XF ) = w5 . W) = X • W. The learning problem is to ﬁnd a weight vector. For his experiments with this process. we have the following predictions: f (XB ) = w1 . Given a set of sequences generated by this process as a training set.) After training. TEMPORALDIFFERENCE LEARNING 1 0 0 0 0 XB z=0 0 1 0 0 0 XC 0 0 1 0 0 XD 0 0 0 1 0 XE 0 0 0 0 1 XF z=1 Typical Sequences: XDXCXDXEXF 1 XDXCXBXCXDXEXDXEXF 1 XDXEXDXCXB 0 Figure 10. (∆W)i . transitions from state i to state j occur with probabilities that depend only on i and j. the weight vector always converged in this way.136 CHAPTER 10. we want to be able to predict the value of z for each X in a test sequence. Sutton used a linear predictor. were computed after each pattern presentation but no weight changes were made until all ten sequences were presented. and this sum was used to change the weight vector to be used for the next pass through the ten sequences. these predictions will be compared with the optimal ones—given the transition probabilities. The weight vector increments were summed after all ten sequences were presented. that is f (X. We assume that the learning system does not know the transition probabilities. that minimizes the meansquared error between z and the predicted value of z. where wi is the ith component of the weight vector. Each of these sequences was presented in turn to a TD(λ) method for various values of λ. f (XC ) = w2 . (Note that the values of the predictions are not limited to 1 or 0—even though z can only have one of those values—because we are minimizing meansquared error.1: A Markov Process This random walk is an example of a Markov process. .
It is well known that. 2/3.) Error using best c 0.1 0.3 0. (For each data point. 1988) Figure 10. 1/3. The rootmeansquared diﬀerences between the best learned predictions (over all c) and these optimal ones are plotted in Fig.10.0 (Adapted from Sutton.10 WidrowHoff TD(1) TD(0) 0. the weight vector always converged to the same vector. the WidrowHoﬀ procedure minimizes the RMS error between its predictions and the actual outcomes in the training set ([Widrow & Stearns. 20. independent of its initial value.2: Prediction Errors for TD(λ) Notice that the WidrowHoﬀ procedure does not perform as well as other versions of TD(λ) for λ < 1! Quoting [Sutton. respectively.7 0. XC . 1988. XF .9 1. 1/2.) After convergence.” (Even though. XE .2 for seven diﬀerent values of λ. How can it be that this optimal method peformed worse than all the TD methods for λ < 1? The answer is that the WidrowHoﬀ procedure only minimizes error on the training set.18 0. page 21]: “This result contradicts conventional wisdom. for ﬁxed.0 0. 1985]). These optimal predictions are simply p(z = 1X). small c. and 5/6 for XB . 10. AN EXPERIMENT WITH TD METHODS 137 and always to the same ﬁnal value [for 100 diﬀerent training sets of ten random sequences]. it does not necessarily minimize error for future experience. XD .20 0. p. the standard error is approximately σ = 0. it might converge to a somewhat diﬀerent vector for diﬀerent values of c.4.01. We can compute these probabilities to be 1/6.14 0. the predictions made by the ﬁnal weight vector are compared with the optimal predictions made using the transition probabilities. under repeated presentations.5 0.12 0. [Later] we prove that in fact it is linear TD(0) that converges to what can be considered the optimal estimates for .16 0.
1 (Sutton. We state some theorems here without proof. 1988) For any absorbing Markov chain. Wi ) and fi = f (Xi . To make the method truly incremental (in analogy with weight updating rules for neural nets). The obvious extension is: i Wi+1 ←− Wi + c(fi+1 − fi ) k=1 λ(i−k) ∂fk ∂W where fi+1 is computed before making the weight change. the predictions of linear TD(0) (with weight updates after each sequence) converge in expected value to the optimal (maximum likelihood) predictions of the true process. But that would make fi = f (Xi . fi+1 = f (Xi+1 . that is. Instead. and for any linearly independent set of observation vectors {Xi } for the nonterminal states. and such a rule would make the prediction diﬀerence.6 IntraSequence Weight Updating Our standard weight updating rule for TD(λ) methods is: m i W ←− W + i=1 c(fi+1 − fi ) k=1 λ(i−k) ∂fk ∂W where the weight update occurs after an entire sequence is observed.5 Theoretical Results It is possible to analyze the performance of the linearprediction TD(λ) methods on Markov processes. This version of the rule has been used in practice with excellent results. for every pair of predictions. page 24. (Also see [Dayan & Sejnowski. namely (fi+1 − fi ). 1992] has extended the result of Theorem 9. the variance of the predictions will approach 0 also. fi+1 = f (Xi+1 . . there exists an ε > 0 such that for all positive c < ε and for any initial weight vector. Even though the expected values of the predictions converge. Wi−1 ). the predictions themselves do not converge but vary around their expected values depending on their most recent experience. Theorem 10. sensitive both to changes in X and changes in W and could lead to instabilities.1 to TD(λ) for arbitrary λ between 0 and 1. it would be desirable to change the weight vector after every pattern presentation. 1994]. TEMPORALDIFFERENCE LEARNING matching future experience—those consistent with the maximumlikelihood estimate of the underlying Markov process. Dayan [Dayan. Wi ). we modify the rule so that. Sutton conjectures that if c is made to approach 0 as training progresses.138 CHAPTER 10.) 10. Wi ).” 10.
10.. The weight changing rule for the ith weight vector in the jth layer of weights has the same form as before. its value would be closer to fi+1 as desired. This change has a direct eﬀect only on the expression for δ (k) which becomes: Wi+1 = Wi + c(fi+1 − fi ) δ (k) = 2(f (k) − f (k) )f (k) (1 − f (k) ) where f (k) and f (k) are two successive outputs of the network. . (fi+1 − fi ).) (b) fi+1 ←− Xi+1 • W (c) di+1 ←− fi+1 − fi (d) W ←− W + c di+1 Xi (If fi were computed again with this changed weight vector. here also it is assumed that f (k) and f (k) are computed using the same weights and then the weights are changed. INTRASEQUENCE WEIGHT UPDATING For TD(0) and linear predictors. For i = 1. W. m. . the rule is: Wi+1 = Wi + c(fi+1 − fi )Xi The rule is implemented as follows: a. namely: Wi where the δi (j) (j) ←− Wi (j) + cδi X(j−1) (j) are given recursively by: mj+1 δi (j+1) wil (j) = fi (1 − fi ) l=1 (j) (j) δl (j+1) wil (j+1) and is the lth component of the ith weight vector in the (j + 1)th layer of weights.) The linear TD(0) method can be regarded as a technique for training a very simple network consisting of a single dot product unit (and no threshold or sigmoid function). b. namely (d − f (k) ). For TD(0) we change the network weights according to the expression: ∂fi ∂W The only change that must be made to the standard backpropagation weightchanging rule is that the diﬀerence term between the desired output and the output of the unit in the ﬁnal (kth) layer. Of course. In the next section we shall see an interesting example of this application of TD learning.6. Initialize the weight vector. must be replaced by a diﬀerence term between successive outputs.. do: 139 (a) fi ←− Xi • W (We compute fi anew each time through rather than use the value of fi+1 the previous time through.. TD methods can also be used in combination with backpropagation to train neural networks. arbitrarily.
140 CHAPTER 10. {X}. TEMPORALDIFFERENCE LEARNING 10. of new board positions. 1 no. and its coding is as shown in Fig.. 1992] learns to play backgammon by training a neural network via temporaldiﬀerence methods.1.. Each member of this set is evaluated by the network.5. 10. That is.. of white on cell 1 2 3 #>3 estimated payoff: d = p1 + 2p2 p3 2p4 estimated probabilities: p1 = pr(white wins) p2 = pr(white gammons) 2 x 24 cells . The structure of the neural net. and who moves up to 40 hidden units 198 inputs hidden and output units are sigmoids learning rate: c = 0. and the pi are the actual probabilities of the various outcomes as deﬁned in the ﬁgure. p3 = pr(black wins) p4 = pr(black gammons) 4 output units no.7 An Example Application: TDgammon A program called TDgammon [Tesauro. off board.3: The TDgammon Network TDgammon learned by using the network to select that move that results in the best predicted payoﬀ. .. where the actual payoﬀ is deﬁned to be df = p1 + 2p2 − p3 − 2p4 . and the one with the largest .3. on bar. initial weights chosen randomly between 0. Figure 10. at any stage of the game some ﬁnite set of moves is possible and these lead to the set. The network is trained to minimize the error between actual payoﬀ and estimated payoﬀ.5 and +0.
recall that for TD(0). Tesauro said: “It appears to be the strongest program ever seen by this author. for input Xt tended toward the expected ﬁnal payoﬀ. The weight adjustment procedure combines temporaldiﬀerence (TD(λ)) learning with backpropagation. The latter case is the same as the WidrowHoﬀ rule. λ = 0. and ∂W is the gradient of dk in this weight space. such as that of TDgammon. and c = 0. for all t.000 games against a neural network trained using expert moves.8 Bibliographical and Historical Remarks To be added. The move is made.” 10. given that input. for all t. TDgammon (with 40 hidden units.2% of 10. the network would be trained so that. For TD(1). BIBLIOGRAPHICAL AND HISTORICAL REMARKS 141 predicted payoﬀ is selected if it is white’s move (and the smallest if it is black’s). its output. After about 200. and the network weights are adjusted to make the predicted payoﬀ from the original position closer to that of the resulting position. If dt is the network’s estimate of the payoﬀ at time t (before a move is made). Commenting on a later version of TDgammon. its output. dt . for input Xt tended toward its expected output.7. the network would be trained so that.1) won 66. for input Xt+1 . the weight changes for the weight vectors in each layer can be expressed in the usual manner. (For a layered. dt+1 .000 games against SUN Microsystems Gammontool and 55% of 10. incorporating special features as inputs. dt .000 games the following results were obtained. the weight adjustment rule is: t ∆Wt = c(dt+1 − dt ) k=1 λt−k ∂dk ∂W ∂dk where Wt is a vector of all weights in the network at time t. .10.) To make the special cases clear. and dt+1 is the estimate at time t + 1 (after a move is made). df .8. feedforward network.
TEMPORALDIFFERENCE LEARNING .142 CHAPTER 10.
of actions to perform. How should it choose its actions so as to maximize its rewards over the long run? To maximize rewards. For the moment. in fact. The situation is as shown in Fig. of states. The learner must ﬁnd the policy by trial and error. ri . will use the notation X to refer to the state of the environment as well as to the input vector. and the cycle repeats. S. That is. which informs the robot about which state the environment is in. that maps input vectors to actions in such a way that maximizes rewards accumulated over time. and. π(X). the action taken at that time is ai . from the environment. and the expected reward. how actions lead to rewards. that is ri = r(Xi . The new state results in the robot perceiving a new input vector.Chapter 11 DelayedReinforcement Learning 11. We assume a discrete time model. We assume that the robot’s sensory apparatus constructs an input vector. Suppose (as an extreme case) that it has no idea about the eﬀects of its actions.” which it occasionally receives. Along with its sensory inputs are “rewards.1. The learner’s goal is to ﬁnd a policy. We formalize the problem in the following way: The robot exists in an environment consisting of a set. and in particular. it doesn’t know how acting will change its sensory inputs. we will assume that the mapping from states to vectors is onetoone. it will need to be able to predict how actions change inputs. received at t = i depends on the action taken and on the state. the input vector at time t = i is Xi . When presented with an input vector. A. This type of learning is called reinforcement learning. 143 .1 The General Problem Imagine a robot that exists in an environment in which it can sense and act. it has no initial knowledge of the eﬀects of its actions. ai ). X. the robot decides which action from a set. Performing the action produces an eﬀect on the environment—moving it to a new state. 11.
In Fig. 11.1). Let’s suppose that whenever the robot lands in the goal cell and gets its reward. it is capable of four actions. respectively. right. or left. A policy for our robot is a speciﬁcation of what action to take for every one of its inputs. the next input to the robot is still (1. we show an optimal policy displayed in this manner. s. it receives a reward of +10.1: Reinforcement Learning 11. n.2 is often used to illustrate reinforcement learning. One way of displaying a policy for our gridworld robot is by an arrow in each cell indicating the direction the robot should move when in that cell. w moving the robot one cell up. In this chapter we will describe methods for learning optimal policies based on reward values received by the learner. 11. The robot receives input vector (x1 .144 CHAPTER 11. . DELAYEDREINFORCEMENT LEARNING (state) Xi Learner (reward) (action) ri Environment ai Figure 11. For example. x2 ) telling it what cell it is in. for every one of the cells in the grid.2 An Example A “grid world. that is. it is immediately transported out to some random cell.3). e. If the robot lands in the cell marked G (for goal). and the robot chooses action w. move right. It is rewarded one negative unit whenever it bumps into the wall or into the blocked cells. down.” such as the one shown in Fig. a component of such a policy would be “when in cell (3. and the quest for reward continues. Imagine a robot initially in cell (2. if the input to the robot is (1.” An optimal policy is a policy that maximizes longterm reward.3).3) and it receives a reward of −1. For example.3.
X. is taken to be γ i ri . we will want to maximize expected future reward and would deﬁne V π (X) as: ∞ V π (X) = E i=0 γ i ri π(X) In either case. the probability that action a in state Xi will lead to state Xj is given by a transition probability p[Xj Xi . are random variables and in which the eﬀects of actions on environmental states are random. ri . occuring i time units in the future. Then the total reward accumulated over all time steps by policy π beginning in state X is: ∞ V π (X) = i=0 γ i ri π(X) One reason for using a temporal discount factor is so that the above sum will be ﬁnite. In general. ri . TEMPORAL DISCOUNTING AND OPTIMAL POLICIES 145 8 7 6 5 4 3 2 1 G R 1 2 3 4 5 6 7 Figure 11. In Markovian environments. we want to consider the case in which the rewards. . one often assumes that rewards in the distant future are not as valuable as are more immediate rewards.2: A Grid World 11.3 Temporal Discounting and Optimal Policies In delayed reinforcement learning. Suppose π(X) we have a policy π(X) that maps input vectors into actions. we call V π (X) the value of policy π for input X. for example. This preference can be accomodated by a temporal discount factor.11. and let ri be the reward that will be received on the ith time step after one begins executing policy π starting in state X. The present value of a reward.3. a]. 0 ≤ γ < 1. An optimal policy is one that maximizes V π (X) for all inputs. Then.
For an optimal policy. π(X)] = the probability that the environment transitions to state X when we execute the action prescribed by π in state X. a]V π (X ) ∗ The theory of dynamic programming (DP) [Bellman. we have the famous “optimality equation:” V π (X) = max r(X. π(X)] = the expected immediate reward received when we execute the action prescribed by π in state X. the value of state X under policy π is the expected value of the immediate reward received when executing the action recommended by π plus the average value (under π) of all of the states accessible from X. π(X)] + γ X p[X X. π ∗ (and no others!).3: An Optimal Policy in the Grid World If the action prescribed by π taken in state X leads to state X (randomly according to the transition probabilities). 1983] assures us that there is at least one optimal policy. that satisﬁes this equation. Ross. V π (X) = the value of state X under policy π. then we can write V π (X) in terms of V π (X ) as follows: V π (X) = r[X. π ∗ . DELAYEDREINFORCEMENT LEARNING 8 7 6 5 4 3 2 1 G R 1 2 3 4 5 6 7 Figure 11. In other words. a) + γ a X ∗ p[X X. 1957.146 CHAPTER 11. r[X. π(X)]V π (X ) where (in summary): γ = the discount factor. DP . and p[X X.
X. If we knew ∗ the transition probabilities. π stand for the policy that chooses action a once. then this adjustment helps to make the two estimates more consistent. of course. and V π (X) for all X and a. asynchronous dynamic programming. a]V π (X ). and thereafter chooses actions according to policy π. Assuming that Vi (X ) is a good estimate for i i Vi (Xi ). and policy iteration. which state. that is. a) + γ a X p[X X. Providing that 0 < ci < 1 and that we visit each state inﬁnitely often. QLEARNING ∗ 147 also provides methods for calculating V π (X) and at least one π ∗ . a]V π (X ) ∗ But. X resulted. Suppose this subsequent state having the highest estimated ˆ value is Xi .π (X) . so we have to ﬁnd a method that eﬀectively learns them. Let a. our input on the ith ˆ step is Xi ). We deﬁne: Qπ (X. X. then it would be easy to implement an optimal policy. if we knew for every state. π ∗ (X) = arg max r(X. If we had a model of actions. an estimated value V (X) to every state. ˆ We see that this adjustment moves the value of Vi (Xi ) an increment (dependˆ ˆ ing on ci ) closer to ri + γ Vi (X ) . Value iteration works as follows: We begin ˆ by assigning. We then select that action a that maximizes the estimated value of the predicted subsequent state. and that the estimated value of state Xi on the ith step is Vi (Xi ). assuming that we know the average rewards and the transition probabilities.11. randomly. Discuss synchronous dynamic programming.4. then we could use a method called value iteration to ﬁnd an optimal policy. a) + γ X p[X X. We would simply select ∗ that a that maximizes r(X. 1989] has proposed a technique that he calls incremental dynamic programming. the average rewards. we are assuming that we do not know these average rewards nor the transition probabilities. ˆ = Vi−1 (X) otherwise. That is. this process of value iteration will converge to the optimal values. a) = V a. Then we update the estimated value. suppose we are at state Xi (that is. On the ith step of the process. of state Xi as follows: ˆ ˆ ˆ Vi (X) = (1 − ci )Vi−1 (X) + ci ri + γ Vi−1 (Xi ) if X = Xi .4 QLearning Watkins [Watkins. and action a. 11. Vi (Xi ).
and . We quote (with minor notational changes) from [Watkins & Dayan. a) is the average value of the immediate reward received when we execute action a in state X. 1992. a. a) + γE Qπ (X . DELAYEDREINFORCEMENT LEARNING Then the optimal value from state X is given by: V π (X) = max Qπ (X. a) a ∗ Note that if an action a makes Qπ (X. the agent: • observes its current state Xi . a) + γE[V π (X )] where r(X. a) larger than V π (X). The optimal policy is given by: π ∗ (X) = arg max Qπ (X. a) = max r(X. a) a ∗ ∗ This equation holds only for an optimal policy. a) a ∗ Watkins’ proposal amounts to a TD(0) method of learning the Q values. • observes the subsequent state Xi . then we could implement an optimal policy simply by selecting that ∗ action that maximized r(X. a) + γE Qπ (X . and states. if we had the optimal Q values (for all a and X). π ∗ . a) = r(X.148 CHAPTER 11. In the ith episode. • receives an immediate reward ri . For an optimal policy (and no others). That is. Suppose action a in state X leads to state X . • selects [using the method described below] and performs an action ai . a) a ∗ ∗ for all actions. then we can improve π by changing it so that π(X) = a. we have another version of the optimality equation in terms of Q values: Qπ (X. a) + γE Qπ (X . Now. X. the agent’s experience consists of a sequence of distinct stages or episodes. π ∗ (X) = arg max r(X. Then using the deﬁnitions of Q and V . a) . it is easy to show that: Qπ (X. page 281]: “In QLearning. Making such a change is the basis for a powerful learning rule that we shall describe shortly.
QLEARNING • adjusts its Qi−1 values using a learning factor ci . whereas Q learning does not. according to: Qi (X. . then the learning procedure described above is exactly a TD(0) method of learning how to predict these Q values. the Q values computed by this learning procedure converge to optimal ones (that is. a) otherwise. = Qi−1 (X. using a model of the eﬀects of actions. the agent always selects that action that maximizes Qi (X. we have: β . V (X ) is the maximum (over all actions) of the Q value of the state next reached when action a is taken from state X.11. And that Q value is adjusted so that it is closer (by an amount determined by ci ) to the sum of the immediate reward plus the discounted maximum (over all actions) of the Q values of the state just entered. Q0 (X. The initial Q values. a) = (1 − ci )Qi−1 (X. Q learning strengthens the usual TD methods. . We deﬁne ni (X. however.” Using the current Q values. is adjusted to equal r + γV (X ). A convenient notation (proposed by [Schwartz. Watkins and Dayan [Watkins & Dayan. a). under certain conditions. a) is the new Q value for input X and action a. a). 1992] prove that. because TD (applied to reinforcement problems using value iteration) requires a onestep lookahead. for all states and actions are assumed given. Then. . a).4. Qi (X. r is the immediate reward when action a is taken in response to input X. a). a) as the index (episode number) of the ith time that action a is tried in state X. If we imagine the Q values to be predictions of ultimate (inﬁnite horizon) total reward. and β is the fraction of the way toward which the new Q value. 1993]) for representing the change in Q value is: Q(X. a) ←− r + γV (X ) where Q(X. a) + ci [ri + γVi−1 (Xi )] if X = Xi and a = ai . Note that only the Q value corresponding to the state just exited and the action just taken is adjusted. Q(X. where Vi−1 (X ) = max [Qi−1 (X . b)] b 149 is the best the agent thinks it can do from state X . to ones on which an optimal policy can be based).
1994]. DELAYEDREINFORCEMENT LEARNING Theorem 11. learning rates 0 ≤ cn < 1. and Extensions of QLearning An Illustrative Example The Qlearning procedure requires that we maintain a table of Q(X. n Again. a) corresponds to the Q values of an optimal policy. with probability 1. no method could be guaranteed to ﬁnd an optimal policy under weaker conditions. page 281]: “The most important condition implicit in the convergence theorem .5. Q learning is best thought of as a stochastic approximation method for calculating the Q values.” The relationships among Q learning. and control are very well described in [Barto. is that the sequence of episodes that forms the basis of learning must include an inﬁnite number of episodes for each starting state and action.a) = ∞. . that the episodes need not form a continuous sequence—that is the X of one episode need not be the X of the next episode. Although the deﬁnition of the optimal Q values for any state depends recursively on expected values of the Q values for subsequent states (and on the expected values of rewards). a) → Q∗ (X. Limitations.1 Discussion. In the grid world that we described earlier.a) 2 <∞ for all X and a. Note. such a table would not be excessively large. a) as n → ∞. and ∞ ∞ cni (X. i=0 i=0 cni (X. & Singh. and given bounded rewards rn  ≤ R. 11. a) values for all stateaction pairs. a portion of such an intial table might be as follows: . under the stochastic conditions of the theorem. Instead. where n Q∗ (X. This may be considered a strong condition on the way states and actions are selected—however. We might start with random entries in the table. for all X and a.150 CHAPTER 11. . 1992. these values are approximated by iterative sampling using the actual stochastic mechanism that produces successor states.5 11. no expected values are explicitly computed by the procedure.1 (Watkins and Dayan) For Markov problems with states {X} and actions {a}. then Qn (X. dynamic programming. Bradtke. however. we quote from [Watkins & Dayan.
DISCUSSION.5 and γ = 0.3) (2. Typically. called the bucket brigade algorithm. as in this example. The general problem of associating rewards with stateaction pairs is called the temporal credit assignment problem—how should credit for a reward be apportioned to the actions leading up to it? Q learning is. 3). w) is adjusted from 7 to 5. One can imagine that better and better approximations to the optimal Q values gradually propagate back from states producing rewards toward all of the other states that the agent frequently visits.5. Notice that an optimal policy might not be discovered if some cells are not visited nor some actions not tried frequently enough. We should note that the learning problem faced by our gridworld robot . has been proposed by [Holland.3) (2. the Q values have to work their way outward from these rewarding states. and. and the learning mechanism attempts to make the value of Q((2. the Q value of Q((2. AND EXTENSIONS OF QLEARNING151 X (2. The learning problem faced by the agent is to associate speciﬁc actions with speciﬁc input patterns.3) is 5. With a learning rate parameter c = 0. LIMITATIONS.3) (2. With random Q values to begin. even then. 3). The agent’s model of the world is obtained by Q learning in its actual world. The maximum Q value in cell (1. the most successful technique for temporal credit assignment. rewards occur somewhat after the actions that lead to them— hence the phrase delayedreinforcement learning.3) (1. w) closer to the discounted value of 5 plus the immediate reward (which was 0 in this case). Learning problems similar to that faced by the agent in our grid world have been thoroughly studied by Sutton who has proposed an architecture. a) 0 0 0 0 1 0 0 0 Suppose the robot is in cell (2. called DYNA. a) 7 4 3 6 4 5 2 4 r(X.3) (1.11. the agent’s actions amount to a random walk through its space of states.3) a w n e s w n e s Q(X. 1990]. DYNA combines reinforcement learning with planning.3)—receiving no immediate reward.3) (1. although a related method.3) (1. to date. Q learning gradually reinforces those actions that contribute to positive rewards by increasing the associated Q values.75. The maximum Q value occurs for a = w. and planning is accomplished by Q learning in its model of the world. Sutton characterizes planning as learning in a simulated world that models the world that the agent inhabits.9. for solving them [Sutton. No other changes are made to the table at this episode. 1986]. Only when this random walk happens to stumble into rewarding states does Q learning begin to produce Q values that are useful. The reader might try this learning procedure on the grid world with a simple computer program.3). so the robot moves west to cell (1.
s. One way to force exploration is to perform occasional random actions (instead of that single action prescribed by the current Q values). in online learning the episodes form a continous sequence. e. including making unvisited states intrinsically rewarding and using an “interval estimate. 1993]. we represent a “goal structure” as a set of rewards to be given for achieving various conditions. could be expressed in terms of a reward that was earned whenever the agent was in that state and performed an action that transitioned back to that state in one step. This strategy would favor exploration at the beginning of learning and exploitation later. In reinforcement learning phraseology. This policy might be modiﬁed by “simulated annealing” which would gradually increase the probability of the action prescribed by the Q values more and more as time goes on. indeed. we have what is called an online learning method. this problem is referred to as the problem of exploitation (of already learned behavior) versus exploration (of possibly better behavior). the agent may ﬁxate on these and never discover a policy that leads to a possibly greater longterm reward.152 CHAPTER 11. 11. Instead of representing a goal as a condition to be achieved. . in turn. the generalized “goal” becomes maximizing discounted future reward instead of simply achieving some particular condition. we might ﬁrst ﬁnd that action prescribed by the Q values and then choose that action with probability 1/2. have been proposed for dealing with exploration.” which is related to the uncertainty in the estimate of a state’s value [Kaelbling. The example presented above included avoiding bumping into the gridworld boundary. This possibility presents an interesting way to generalize the classical notion of a “goal” in AI planning systems—even in those that do no learning. of a particular state. If online learning discovers some good paths to rewards. This distribution. also.2 Using Random Actions When the next pattern presentation in a sequence of patterns is the one caused by the agent’s own action in response to the last pattern. and choose the opposite action with probability 1/8. the convergence theorem for Q learning does not require online learning. For example. one could imagine selecting an action randomly according to a probability distribution over the actions (n. special precautions must be taken to ensure that online learning meets the conditions of the theorem. choose the two orthogonal actions with probability 3/16 each. Other methods. could depend on the Q values.5. As already mentioned. in the gridworld problem. DELAYEDREINFORCEMENT LEARNING could be modiﬁed to have several places in the grid that give positive rewards. For example. A goal of maintenance. In Watkins and Dayan’s terminology. This generalization can be made to encompass socalled goals of maintenance and goals of avoidance. Then. and w).
. X) R dot product units Figure 11. consider the simple linear machine shown in Fig. Wi Q(ai. such as backpropagation.4: A Net that Computes Q Values Such a neural net could be used by an agent that has R actions to select from. The TD(0) method for updating Q values would then have to be combined with a multilayer weightchanging rule. This kind of aggregation is an example of what has been called structural credit assignment.11. X) = X ..5... trainable weights Q(a1. given pattern inputs and actions. X) Q(a2. Various researchers have suggested mechanisms for computing Q values. Sigmoid units in the ﬁnal layer would compute Q values in the range 0 to 1.5.4. LIMITATIONS. Combining TD(λ) and backpropagation is a method for dealing with both the temporal and the structural credit assignment problems. 11. Q(aR. AND EXTENSIONS OF QLEARNING153 11. One method that suggests itself is to use a neural network. DISCUSSION. For example. X) X . .3 Generalizing Over Inputs For large problems it would be impractical to maintain a table like that used in our gridworld example. Weight adjustments are made according to a TD(0) procedure to bring the Q value for the action last selected closer to the sum of the immediate reward (if any) and the (discounted) maximum Q value for the next input pattern. The Q values (as a function of the input pattern X and the action ai ) are computed as dot products of weight vectors (one for each action) and the input vector. If the optimum Q values for the problem (whatever they might be) are more complex than those that can be computed by a linear machine. Wi . Networks of this sort are able to aggregate diﬀerent input vectors into regions for which the same action should be performed. a layered neural network might be used.
Several researchers have attempted to deal with this problem using a variety of methods including attempting to model “hidden” states by using internal memory [Lin. Exploration versus exploitation.) We have already touched on some diﬃculties. DELAYEDREINFORCEMENT LEARNING Interesting examples of delayedreinforcement training of simulated and actual robots requiring structural credit assignment have been reported by [Lin. given any action. With perceptual aliasing.5. is probably unique in terms of success on a highdimensional problem. use estimates of Q values (rather than random values) initially • use a hierarchy of actions.4 Partially Observable States So far. we have identiﬁed the input vector. That is. Because of inevitable perceptual limitations. This phenomenon has been referred to as perceptual aliasing. we can no longer guarantee that Q learning will result in even useful action policies.154 CHAPTER 11. mentioned in the last chapter. • use random actions • favor states not visited recently • separate the learning phase from the use phase • employ a teacher to guide exploration b. When such is the case. if some aspect of the environment cannot be sensed currently. When the input vector results from an agent’s perceptual apparatus (as we assume it does). we no longer have a Markov problem. 1992. 1992]. let alone optimal ones. 11. 11. X. Slow time to convergence • combine learning with prior knowledge. learn primitive actions ﬁrst and freeze the useful sequences into macros and then learn how to use the macros . there is no reason to suppose that it uniquely identiﬁes the environmental state. the next X vector. 1993].5 Scaling Problems Several diﬃculties have so far prohibited wide application of reinforcement learning to large problems.5. these and others are summarized below with references to attempts to overcome them. It might be possible to reinstate a Markov framework (over the X’s) if X includes not only current sensory precepts but information from the agent’s memory. Mahadevan & Connell. (The TDgammon program. several diﬀerent environmental states might give rise to the same input vector. perhaps it was sensed once and can be remembered by the agent. with the actual state of the environment. a. that is. may depend on a sequence of previous ones rather than just the immediately preceding one.
• use a learning method based on average rewards [Schwartz. 1993] c. . Temporal discounting problems. if the rewards change. 8. do several updates per episode [Moore & Atkeson. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 155 • employ a teacher.6 Bibliographical and Historical Remarks To be added. What is learned depends on the reward structure.6. Sometimes the reinforcement learning part can be replaced by a “planner” that uses the action model to produce plans to achieve goals. 11.g. May.11. and use examples of good behavior [Lin. and then learn the “values” of states by reinforcement learning for each diﬀerent set of rewards. learning has to start over. Large state spaces • use handcoded features • use neural networks • use nearestneighbor methods [Moore. Using small γ can make the learner too greedy for present rewards and indiﬀerent to the future. Also see other articles in the special issue on reinforcement learning: Machine Learning. but using large γ slows down learning. No “transfer” of learning . 1992] • use more eﬃcient computations. • Separate the learning into two parts: learn an “action model” which predicts how actions change states (and is constant over all problems). 1990] d. 1993] e. use graded “lessons”—starting near the rewards and then backing away. 1992. e.
DELAYEDREINFORCEMENT LEARNING .156 CHAPTER 11.
Making this conclusion and saving it is an instance of deductive learning—a topic we study in this chapter. Ball(Obj1). typically the training set does not exhaust the version space. then we could logically conclude Round(Obj5). for example those using nonmonotonic reasoning. Ball(Obj2). This conclusion may be useful (if there are no facts of the form Ball(σ) ∧ ¬Round(σ)). In logic. we could say that the classiﬁer’s output does not logically follow from the training set. these methods are inductive. On the other hand. we implicitly knew φ all along. Using logical terminology. Ball(Obj3). Ball(Obj4)} A learning system that forms the conclusion (∀x)[Ball(x) ⊃ Round(x)] is inductive. logically follows from some set of facts. if we had the facts Green(Obj5) and Green(Obj5) ⊃ Round(Obj5). Yet. 157 . Under what circumstances might we say that the process of deducing φ from ∆ results in our learning φ? In a sense. Suppose that some logical proposition.1 To contrast inductive with deductive systems in a logical setting.Chapter 12 ExplanationBased Learning 12. since it was inherent in knowing ∆. a deductive system is one whose conclusions logically follow from a set of input facts. φ might not be obvious given ∆. Round(Obj2). and 1 Logical reasoning systems that are not sound.1 Deductive Learning In the learning methods studied so far. themselves might produce inductive conclusions that do not logically follow from the input facts. Round(Obj3). ∆. suppose we have a set of facts (the training set) that includes the following formulas: {Round(Obj1). but it does not logically follow from the facts. Round(Obj4). In this sense. if the system is sound. φ.
Typically. type of employment. Speedup learning simply makes it possible to make those decisions more eﬃciently. A priori information about a problem can be expressed in several ways. The learning methods are successful only if the hypothesis set is appropriate for the problem. A less direct method involves making assertions in a logical language about the property we are trying to learn. for example. the hypothesis set from which we choose functions). the more a priori information we have about the function being sought). the smaller the hypothesis set (that is. the less dependent we are on information being supplied by a training set (that is. Shouldn’t that process count as learning? Dietterich [Dietterich. that we wanted to classify people according to whether or not they were good credit risks. In EBL. etc. in that case we can learn a more general fact. A set of such assertions is usually called a “domain theory. On the surface. Strictly speaking. in principle.1.2 Domain Theories Two types of information were present in the inductive methods we have studied: the information inherent in the training samples and the information about the domain that is implied by the “bias” (for example. fewer samples). 12. we might want to save it. speedup learning does not result in a system being able to make decisions that. The methods we have studied so far restrict the hypotheses in a rather direct way. EBL can be thought of as a process in which implicit knowledge is converted into explicit knowledge.). This process is illustrated in Fig. this type of learning might make possible certain decisions that might otherwise have been infeasible. assemble such . As another example. It is called explanationbased learning (EBL). Let us further suppose that the proof we constructed did not depend on the given triangle being a right triangle.” Suppose. could not have been made before the learning took place. The learning technique that we are going to study next is related to this example.158 CHAPTER 12. suppose we are given some theorems about geometry and are asked to prove that the sum of the angles of a right triangle is 180 degrees. EXPLANATIONBASED LEARNING the deduction process to establish φ might have been arduous. Rather than have to deduce φ again. But. 1990] has called this type of learning speedup learning. there seems to be no real diﬀerence between the experiencebased hypotheses that a chess player makes about what constitutes good play and the kind of learning we have been studying so far. To take an extreme case. perhaps along with its deduction. then we generalize the explanation to produce another element of the domain theory that will be useful on similar examples. We might represent a person by a set of properties (income. 12. marital status. a chess player can be said to learn chess even though optimal play is inherent in the rules of chess. we specialize parts of a domain theory to explain a particular example. in practice. in case it is needed later.
The attributes that we use to represent a robot might include some that are relevant to this decision and some that are not. We want to ﬁnd a way to classify robots as “robust” or not. AN EXAMPLE 159 Domain Theory specialize Example (X is P) Prove: X is P Complex Proof Process Explanation (Proof) generalize A New Domain Rule: Things "like" X are P Y is like X Trivial Proof Y is P Figure 12. Or. let’s consider the following fanciful example.12. ask him or her what sorts of things s/he looks for in making a decision about a loan. we might go to a loan oﬃcer of a bank. but perhaps the application of these policies were specialized and made more eﬃcient through experience with the special cases of loans made in his or her district.3. encode this knowledge into a set of rules for an expert system. and then use the expert system to make decisions. The knowledge used by the loan oﬃcer might have originated as a set of “policies” (the domain theory). . 12.3 An Example To make our discussion more concrete.1: The EBL Process data about people who are known to be good and bad credit risks and train a classiﬁer to make decisions.
u) ⊃ Robust(u) (An individual that can ﬁx itself is robust. GR) . help to deﬁne whether or not a robot can be classiﬁed as robust..) Robot(w) ⊃ Sees(w.) We could use theoremproving methods operating on this domain theory to conclude whether certain robots are robust. but among other things. (The same domain theory may be useful for several other purposes also. let’s suppose that our domain theory includes the sentences: F ixes(u.”) In this example.) C3P O(x) ⊃ Habile(x) (C3POclass individuals are habile. We next show how such a new rule might be derived in this example.. y) ∧ Habile(x) ⊃ F ixes(x. Suppose we are given a number of facts about Num5. w) (All robots can see themselves.160 CHAPTER 12. These methods might be computationally quite expensive because extensive search may have to be performed to derive a conclusion. it describes the concept “robust.) Sees(x.. (By convention. such as: Robot(N um5) R2D2(N um5) Age(N um5. But after having found a proof for some particular robot. variables are assumed to be universally quantiﬁed..) . EXPLANATIONBASED LEARNING Suppose we have a domain theory of logical sentences that taken together. . y) (A habile individual that can see another entity can ﬁx that entity. 5) M anuf acturer(N um5.) R2D2(x) ⊃ Habile(x) (R2D2class individuals are habile. we might be able to derive some new sentence whose use allows a much faster conclusion.
12.12. 1986] call this aspect of explanationbased learning feature elimination.y) & Habile(x) => Fixes(x.2: A Proof Tree We are also told that Robust(N um5) is true. We see from this explanation that the only facts about Num5 that were used were Robot(N um5) and R2D2(N um5). This type of pruning is the ﬁrst sense in which an explanation is used to generalize the classiﬁcation problem.2 is one that a typical theoremproving system might produce. but we nevertheless attempt to ﬁnd a proof of that assertion using these facts about Num5 and the domain theory. AN EXAMPLE 161 Robust(Num5) Fixes(u. There might be little value in learning that rule since it is so speciﬁc. In this example.y) Sees(Num5. In fact. The facts about Num5 correspond to the features that we might use to represent Num5. In the language of EBL. The relevant ones are those used or needed in proving Robust(N um5) using the domain theory. Num5) Sees(x. The proof tree in Fig.3.w) Robot(Num5) R2D2(x) => Habile(x) R2D2(Num5) Figure 12. we could construct the following rule from this explanation: Robot(N um5) ∧ R2D2(N um5) ⊃ Robust(N um5) The explanation has allowed us to prune some attributes about Num5 that are irrelevant (at least for deciding Robust(N um5)). not all of them are relevant to a decision about Robust(N um5).) But the rule we extracted from the explanation applies only to Num5. u) => Robust(u) Fixes(Num5. ([DeJong & Mooney.Num5) Habile(Num5) Robot(w) => Sees(w. Can it be generalized so that it can be applied to other individuals as well? . this proof is an explanation for the fact Robust(N um5).
the example told us nothing new other than what it told us about Num5. 1972]. we replace Robot(N um5) by Robot(r) and R2D2(N um5) by R2D2(s) and redo the proof—using the explanation proof as a template.3. for example. Note that we use diﬀerent values for the two diﬀerent occurrences of N um5 at the tip nodes. the formulas in the actual data base . under certain assumptions. 1991]. This rule is the end result of EBL for this example.) In fact. Basing the generalization process on examples helps to insure that we learn rules matched to the distribution of problems that occur. We now apply the rules used in the proof in the forward direction. the proof would have been shorter. 12. 1986] call identity elimination (the precise identity of Num5 turned out to be irrelevant). 1986] and diﬀers from that of [Mitchell. We can generalize the proof by a process that replaces constants in the tip nodes of the proof tree with variables and works upward—using uniﬁcation to constrain the values of variables as needed to obtain a proof. Doing so sometimes results in more general. EXPLANATIONBASED LEARNING Examination of the proof shows that the same proof structure. (In the literature. Sees(x.4 Evaluable Predicates The domain theory includes a number of predicates other than the one occuring in the formula we are trying to prove and other than those that might customarily be used to describe an individual.. In this example. (Note that we always substitute terms that are already in the tree for variables in rules. et al. The substitutions are then applied to the variables in the tip nodes and the root node to yield the general rule: Robot(r) ∧ R2D2(r) ⊃ Robust(r). keeping track of the substitutions imposed by the most general uniﬁers used in the proof. It is also similar to that used in [Fikes. In the latter application. y). The process by which N um5 in this example was generalized to a variable is what [DeJong & Mooney. that if we used Habile(N um5) to describe Num5.) Clearly. Note that the occurrence of Sees(r. could be used independently of whether we are talking about Num5 or some other individual. (The generalization process described in this example is based on that of [DeJong & Mooney.) This process results in the generalized proof tree shown in Fig. 1986]. One might note. The sole role of the example in this instance of EBL was to provide a template for a proof to help guide the generalization process. y) ∧ Habile(y) ⊃ F ixes(x.162 CHAPTER 12. It is important to note that we could have derived the general rule from the domain theory without using the example. Why didn’t we? The situation is analogous to that of using a data base augmented by logical rules. r) as a node in the tree forces the uniﬁcation of x with y in the domain rule. et al. using the same sentences from the domain theory. There are a number of qualiﬁcations and elaborations about EBL that need to be mentioned. this general rule is more easily used to conclude Robust about an individual than the original proof process was. 12.. doing so is called static analysis [Etzioni. but nevertheless valid rules.
y) & Habile(x) => Fixes(x. Perhaps another analogy might be to neural networks.” and those in the logical rules are “intensional.w) Robot(r) R2D2(s) becomes R2D2(r) after applying {r/s} Figure 12.4. EVALUABLE PREDICATES 163 Robust(r) Fixes(u. such formulas satisfy what is called the operationality criterion. r/s} Sees(r. The EBL process assumes something similar. Finding the new rule corresponds to ﬁnding a simpler expression for the formula to be proved in terms only of the evaluable predicates.y) {r/x. The logical rules serve to connect the data base predicates with higher level abstractions that are described (if not deﬁned) by the rules.” This usage reﬂects the fact that the predicates in the data base part are deﬁned by their extension—we explicitly list all the tuples sastisfying a relation. The domain theory is useful for connecting formulas that we might want to prove with those whose truth values can be “looked up” or otherwise evaluated. they have to be derived using the rules and the database.r) Habile(s) {s/x} R2D2(x) => Habile(x) {r/w} Robot(w) => Sees(w. We typically cannot look up the truth values of formulas containing these intensional predicates.12. r) Sees(x. . the predicates in the domain theory correspond to the hidden units. u) => Robust(u) {r/u} Fixes(r. The evaluable predicates correspond to the components of the input pattern vector. r/y.3: A Generalized Proof Tree are “extensional. In the EBL literature.
and pushing it . 12. 12. usually with positive utility. We show an example of a process for creating macrooperators based on techniques explored by [Fikes. 1990] for an analysis). . 12. Such a rule might have resulted if we were given {C3P O(N um6). Bionic(u) such that the domain theory also contained R2D2(x) ⊃ Bionic(x) and C3P O(x) ⊃ Bionic(x). In our example. . Adding a new rule decreases the depth of the shortest proof but it also increases the number of formulas in the domain theory. Referring to Fig. 12. (See [Minton.6 Utility of EBL It is well known in theorem proving that the complexity of ﬁnding a proof depends both on the number of formulas in the domain theory and on the depth of the shortest proof.. say.5 More General Proofs Examining the domain theory for our example reveals that an alternative rule might have been: Robot(u) ∧ C3P O(u) ⊃ Robust(u). Thus. we might be willing to induce the formula Bionic(u) ⊃ [C3P O(u) ∨ R2D2(u)] in which case the rule with the disjunction could be replaced with Robot(u) ∧ Bionic(u) ⊃ Robust(u). Adding disjunctions for every alternative proof can soon become cumbersome and destroy any eﬃciency advantage of EBL. EXPLANATIONBASED LEARNING 12. the eﬃciency might be retrieved if there were another evaluable predicate. In realistic applications. 1972]. the question arises. the added rules will be relevant for some tasks and not for others. Robot(N um6). do we want to generalize the two rules to something like: Robot(u) ∧ [C3P O(u) ∨ R2D2(u)] ⊃ Robust(u)? Doing so is an example of what [DeJong & Mooney.7 Applications There have been several applications of EBL methods.1 MacroOperators in Planning In automatic planning systems.164 CHAPTER 12. namely the formation of macrooperators in automatic plan generation and learning how to control search.} and proved Robust(N um6). 1986] call structural generalization (via disjunctive augmentation ).7. et al. . eﬃciency can sometimes be enhanced by chaining together a sequence of operators into macrooperators. consider the problem of ﬁnding a plan for a robot in room R1 to fetch a box. EBL methods have been applied in several settings. B1. After seeing a number of similar examples. it is unclear whether the overall utility of the new rules will turn out to be positive. After considering these two examples (Num5 and Num6). We mention two here. R2. by going to an adjacent room.4.
r1) Delete list: IN ROOM (ROBOT. Figure 12.R2) CONNECTS(D1..4: Initial State of a Robot Problem We will construct the plan from a set of STRIPS operators that include: GOTHRU(d. APPLICATIONS 165 back to R1. are: IN ROOM (ROBOT. R1) . true in the initial state. r1). r2). r2) A backwardreasoning STRIPS system might produce the plan shown in Fig. and the facts that are true in the initial state are listed in the ﬁgure.) The preconditions for this plan. We show there the main goal and the subgoals along a solution path. r2) Delete list: IN ROOM (ROBOT. r2) Preconditions: IN ROOM (ROBOT. The goal for the robot is IN ROOM (B1.7.R1) . IN ROOM (b.R2. r1.5. r1) Add list: IN ROOM (ROBOT. r1. r1. r1) Add list: IN ROOM (ROBOT. CON N ECT S(d. (The conditions in each subgoal that are true in the initial state are shown underlined. r1).12. r2) Preconditions: IN ROOM (ROBOT. R1 D1 B1 D2 R3 R2 Initial State: INROOM(ROBOT. R1) INROOM(B1. r1). d. R1). r2). IN ROOM (b. r2) PUSHTHRU(b..R1. 12. IN ROOM (b. r1. CON N ECT S(d.R2) CONNECTS(D1.
INROOM(B1. The preconditions for the generalized plan are: IN ROOM (ROBOT. R1).166 CHAPTER 12. INROOM(B1. We then follow the structure of the speciﬁc plan to produce the generalized plan shown in Fig. We ﬁrst generalize these preconditions by substituting variables for constants. R2). CONNECTS(d. R1) IN ROOM (B1. EXPLANATIONBASED LEARNING CON N ECT S(D1. r4) INROOM(B1. R2) Saving this speciﬁc plan. R1). r4).6 that achieves IN ROOM (b1. R2).5: A Plan for the Robot Problem Another related technique that chains together sequences of operators to form more general ones is the chunking mechanism in Soar [Laird. r2. INROOM(B1. R2) Figure 12.R1) R1 D1 R2 INROOM(ROBOT.R1) R3 {R1/r3. {R2/r1. R1). would not be as useful as would be saving a more general one. r3.d. r1) CON N ECT S(d1. R2) CON N ECT S(D1. R1.r1. CONNECTS(d2. R1).R1) PUSHTHRU(B1. 12. R1. r1. r2) CON N ECT S(d2. R2. et al. r4) IN ROOM (b. R2) D1/d} GOTHRU(d2. R2) B1 D2 PLAN: GOTHRU(D1. R2. CONNECTS(D1. r3).R2. r1) INROOM(ROBOT. R2) INROOM(ROBOT. CONNECTS(D1.. R1). . valid only for the speciﬁc constants it mentions. CONNECTS(D1. R2. r3. CONNECTS(D1. Note that the generalized plan does not require pushing the box back to the place where the robot started. R2. r1). 1986].R1. INROOM(B1. r1. D1/d2} INROOM(ROBOT.R2) PUSHTHRU(B1. R2).D1.
it analyzes its successful and its unsuccessful choices and attempts to explain them in terms of its domain theory. r4).12.7.r4) INROOM(ROBOT. r1). Minton proposed using EBL to learn eﬀective ways to control search [Minton. etc. CONNECTS(d2. either succeeds or fails. r1. After producing a plan. CONNECTS(d2. r2) INROOM(ROBOT. and in jobshop scheduling. r4) GOTHRU(d1.6: A Generalized Plan 12. 1988]. EBL methods can be used to improve the eﬃciency of planning in another way also. INROOM(b1. r2. r4). Using an EBLlike process. CONNECTS(d1. an operator to apply. PRODIGY has a domain theory involving both the domain of the problem and a simple (meta) theory about planning. it is able to produce useful control rules such as: .7. INROOM(b1. Its meta theory includes statements about whether a control choice about a subgoal to work on.r2.d2. r1. r2. In his system called PRODIGY.2 Learning Search Control Knowledge Besides their use in creating macrooperators. r1. r2). CONNECTS(d1. r2). r4) Figure 12. APPLICATIONS 167 INROOM(b1.r4) PUSHTHRU(b1. in a simple mobile robot world. r2). PRODIGY is a STRIPSlike system that solves planning problems in the blocksworld.
168 CHAPTER 12. EXPLANATIONBASED LEARNING IF (AND (CURRENT − NODE node) (CANDIDATE − GOAL node (ON x y)) (CANDIDATE − GOAL node (ON y z))) THEN (PREFER GOAL (ON y z) TO (ON x y)) PRODIGY keeps statistics on how often these learned rules are used. thus measured. 1990] has shown that there is an overall advantage of using these rules (as against not having any rules and as against handcoded search control rules). is judged to be high. 12. their savings (in time to ﬁnd plans). Minton [Minton. Bibliographical and Historical Remarks .8 To be added. It saves only the rules whose utility. and their cost of application.
NJ: Erlbaum.. 1992. et al. 1991. A. An Introduction to Multivariate Statistical Analysis. on Innovative Applications of Artiﬁcial Intelligence. 151160.” Info. Menlo Park.” Machine Learning. [Blumer.” Proc. and Bower.. 1994] Barto. [Bellman.). 1. S. “Learning to Act Using RealTime Dynamic Programming. [Anderson. Lett. “SMART: Support Management Automated Reasoning Technology for COMPAQ Customer Service. S.. et al. E.. 1958] Anderson... 1957.. Human Associative Memory. “InstanceBased Learning Algorithms. and Walden. vol 24. A. S.. J. and Singh. T. Princeton: Princeton University Press. “Occam’s Razor. J. [Baum. D. and Albert.. Drastal. [Aha.. et al. and Haussler.. & Singh. [Barto.... “Learnability and the VapnikChervonenkis Dimension. Cambridge. M. T. 1988. N. 1973. [Baum & Haussler. Computational Learning Theory and Natural Learning Systems. S. 1973] Anderson. pp. E. A. MA: MIT Press. Process. 1989] Baum. 1994] Baum. 1957] Bellman. et al.. MA: AddisonWesley. Reading. 1991] Aha. [Anderson & Bower. Computer Control of Machines and Processes. 3766. 1994. Kibler... R. (eds.. W.. pp. 1989. 1992] Acorn. [Blumer.” JACM. Bradtke. 415442. Volume 1: Constraints and Prospects. 1988] Bollinger. 169 . [Bollinger & Duﬃe. G. New York: John Wiley. CA: AAAI Press. Fourth Annual Conf. H. Dynamic Programming. Bradtke.. Hillsdale. pp. R. R.. D. 1994. “What Size Net Gives Valid Generalization?” Neural Computation. G.” to appear in Artiﬁcial Intelligence.Bibliography [Acorn & Walden. 1990] Blumer. 1958. and Rivest. E. 1987] Blumer... 1987... 6. D. “When Are kNearest Neighbor and Backpropagation Accurate for FeasibleSized Sets of Examples?” in Hanson.. 37780. 1990. and Duﬃe.
R. Michalski.. Classiﬁcation and Regression Trees. (eds. June 1962 and September 1962. A. J. “Nearest Neighbor Pattern Classiﬁcation. 295301. and Dietterich. 1967] Cover. 1988. and Mooney. G... and Hart.... Stanford University. Workshop on Machine Learning. 9 (pp.170 BIBLIOGRAPHY [Brain.. Applied Optimal Control. T. P. on Information Theory. San Francisco: Morgan Kaufmann. 1:145176. [Dasarathy. [Brent. “AutoClass: A Bayesian Classiﬁcation System. C. (eds. et al. 1992. 1990] Brent. B. V. “ExplanationBased Learning: An Alternative View. “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition. Elec. New York: Blaisdell. 1962] Brain. 913) and No. 1967. J. Morgan Kaufmann. J. Readings in Machine Learning. Menlo Park. 1984.. and Ho. 1990. 1965... 1965] Cover. 1983] Carbonell. Olshen. T. pp 452467.. CA. Y. 1994. B. T. “Graphical Data Processing Research Study and Experimental Investigation. [Cover & Hart. [Dayan & Sejnowski. 1993. pp.” Machine Learning. 8. Computer Science Department. San Mateo.. 8 (pp. IEEE Computer Society Press. R. [Cheeseman. Stanford. 1992] Dayan. T. [Buchanan & Wilkins. 296306.” IEEE Trans. and Mitchell.. Nearest Neighbor Pattern Classiﬁcation Techniques.). Carbonell.” IEEE Trans. 14. 1983. Contract DA 36039 SC78343. 1986] DeJong... 13. SRI International.” Machine Learning. [Dayan. CA: Wadsworth. 1991.. E. et al. . [Breiman. et al. R. Readings in Knowledge Acquisition and Learning. J. 2127. pp.” Report No. P. and Dietterich. and Stone. [Carbonell.C. Readings in Machine Learning.” Proc.. Monterey. A. Comp. T. 1988] Cheeseman.” Machine Learning.” in Machine Learning: An Artiﬁcial Intelligence Approach. 326334. D.. 1991] Dasarathy. San Francisco: Morgan Kaufmann.. “T D(λ) Converges with Probability 1. et al. Reprinted in Shavlik. 1984] Breiman. 1990.). Reprinted in Shavlik. March 1990. Morgan Kaufmann.. et al. R..” Numerical Analysis Project Manuscript NA9003. “The Convergence of TD(λ) for General λ. P. P. Fifth Intl.. 310). J. “Fast Training Algorithms for MultiLayer Neural Nets. CA.. 341362. and Wilkins. T. 1993] Buchanan. [DeJong & Mooney. and Sejnowski. L. P. San Francisco: Morgan Kaufmann... [Cover. San Francisco.. Friedman.. EC14... CA 94305.. 1986. 1994] Dayan. [Bryson & Ho 1969] Bryson. “Learning by Analogy. June..
E. pp. [Duda... Ninth Nat. D. P..” Proc. MIT Press. and Lebiere. Vanderbilt University. on Elect. 1991] Etzioni. 1982] Efron. New York: Wiley. H. pp. [Efron.” in Touretzky. R.). “A General Lower Bound on the Number of Examples Needed for Learning. [Etzioni. April 1966. O. T. 1992] Evans.. G. T. et al. 1988 Workshop on Computational Learning Theory. 524532. . T.. 1990] Fahlman.. EC15. Comput. and Fisher. 1990. 1988. and Bakiri. [Dietterich. pp. Conf. R. Palo Alto: Annual Reviews Inc. 1982. March. Mach. 1991] Dietterich. et al.. 4:255306. “ErrorCorrecting Output Codes: A General Method for Improving Multiclass Inductive Learning Programs. 60:1. 1993] Etzioni. Menlo Park: AAAI Press. “Machine Learning. 1990] Dietterich. The Jackknife. 1973] Duda. Menlo Park.. 1990. pp.” IEEE Trans.. 1966.. D... [Duda & Hart. O.BIBLIOGRAPHY 171 [Dietterich & Bakiri. 1966] Duda.. “The CascadeCorrelation Learning Architecture. G. B.. on A. B. 1973. 1990] Dietterich. Advances in Neural Information Processing Systems. 1966] Duda. 533540. Rev. G. 2431. [Ehrenfeucht. “A Structural Theory of ExplanationBased Learning.. Report CS9206. Computers. Porter. the Bootstrap and Other Resampling Plans. of Ninth National Conf. April. Seventh Intl. O.” Artiﬁcial Intelligence. Report prepared for ONR under Contract 3438(00).. pp. “A Comparative Study of ID3 and Backpropagation for English TexttoSpeech Mapping. vol. San Francisco: Morgan Kaufmann. 110120. and Mooney..” Proc.” in Proc. R. 2.” Proc. 220232. R. [Evans & Fisher. Learning. 1990. 572577.. [Duda & Fossum. et al. pp.. S. pp. Philadelphia: SIAM. Hild. 1991. “STATIC: A ProblemSpace Compiler for PRODIGY. San Francisco: Morgan Kaufmann. 1991. and Hart. (ed. 1993. Process Delay Analyses Using DecisionTree Induction. 1988] Ehrenfeucht. AAAI91.).. and Bakiri. Conf. 93139. Department of Computer Science. [Dietterich. B. TN. O. SRI International..” SRI Tech.. Pattern Classiﬁcation and Scene Analysis. A. Sci. CA. H. on Artiﬁcial Intelligence.. O.I. [Etzioni. and Fossum. San Francisco: Morgan Kaufmann.. “Training a Linear Machine on Mislabeled Patterns.” Annu. Tech.. “Pattern Classiﬁcation by Iteratively Determined Linear and Piecewise Linear Discriminant Functions. (eds. 1992. C. [Fahlman & Lebiere..
. 849852. 1987. U. San Francisco: Morgan Kaufmann. 96107.. “Learning and Executing Generalized Robot Plans... 1994. 1994] Fu... M. J. “The Simulation of Verbal Learning Behavior. U.” IEEE Spectrum. T. Weir. March. Cambridge: The MIT Press. U. J. and Nilsson. Djorgovski. 1987] Genesereth. J. 1977] Friedman. D. 1989. pp. et al. Software. New York: McGrawHill. on Pattern Recognition.” in Proc. I..” Artiﬁcial Intelligence. J. 1987] Fisher. “Knowledge Acquisition via Incremental Conceptual Clustering. L. A. 1988.” Artiﬁcial Intelligence. R. June 1993. 1990. San Francisco: Morgan Kaufmann. San Francisco: Morgan Kaufmann. Conf. Neural Networks in Artiﬁcial Intelligence. [Gluck & Rumelhart.. D. Hart.. pp. S. Bentley. 1989] Gluck. 1972. 2632. 1987. “Optimal Linear Discriminants.. D. 1990. Logical Foundations of Artiﬁcial Intelligence..” in Fayyad.. and Dietterich. pp.(eds. R. on Machine Learning. S. P. on Math.. and Nilsson.). pp 468486.. et al. 1996. [Genesereth & Nilsson. and Weir. N.. pp. T. pp. 1988] Haussler. 3(3):209226. et al. . N. J.. [Hammerstrom. [Fu... pp. [Gallant. 1986..” in Eighth International Conf. 112119. G. “SKICAT: A Machine Learning System for Automated Cataloging of Large Scale Sky Surveys. September 1977. San Francisco: Morgan Kaufmann..) [Feigenbaum. [Friedman. Reprinted in Shavlik.. 1990.” ACM Trans. [Fikes.” Machine Learning. 1993] Fayyad. and Dietterich. N. M. Readings in Machine Learning. E. [Fisher. 36:177221. pp 251288. 1986] Gallant. L. “Automating the Analysis and Cataloging of Sky Surveys. and Dietterich. “Quantifying Inductive Bias: AI Learning Algorithms and Valiant’s Learning Framework... Hillsdale. [Haussler. H. and Djorgovski. 267–283. Tenth Intl.. Advances in Knowledge Discovery and Data Mining. “Neural Networks at Work. T. 2:139172. 1961] Feigenbaum. Reprinted in Shavlik. 471ﬀ. and Rumelhart. New York: IEEE. D..172 BIBLIOGRAPHY [Fayyad.. Reprinted in Shavlik. (For a longer version of this paper see: Fayyad. NJ: Erlbaum Associates. Readings in Machine Learning. 1961. Readings in Machine Learning.” Proceedings of the Western Joint Computer Conference. M. The Developments in Connectionist Theory. 1993. A. 1972] Fikes. Chapter 19.. “An Algorithm for Finding Best Matches in Logarithmic Expected Time. 19:121132.. San Francisco: Morgan Kaufmann. N. 1993] Hammerstrom. Neuroscience and Connectionist Theory. and Finkel. et al.
Cambridge. chapter 20.. and Palmer. D. J. Cambridge. 1987. 1987] Jabbour. MA: MIT Press.. 11011108.. and Mitchell.. Experiments in Induction. The Possibilities of GeneralPurpose Learning Algorithms Applied to Parallel RuleBased Systems. H. 1995] John. Learning in Embedded Systems. [Hirsh. Studies in the Sciences of Complexity. Cambridge. A. 545. [Hertz.” Proc. R. 1949. CaseBased Reasoning. H. on Artiﬁcial Intelligence and Statistics. P. Adaptation in Natural and Artiﬁcial Systems. “ALFA: Automated Load Forecasting Assistant. [Hunt. 1991] Hertz. 1986. Lecture Notes.” In Michalski.” Proc. 1966] Hunt. San Francisco: Morgan Kaufmann. Introduction to the Theory of Neural Computation.) [Holland. Machine Learning: An Artiﬁcial Intelligence Approach. MA: MIT Press... 1949] Hebb. CA. & Stone. Marin. Krogh. 1994. Cambridge. R... 1994] Koza. O. Eighth Nat. [Koza.. 1966. 1975] Holland. FL.. “Generalizing Version Spaces. (eds.. Carbonell. Ann Arbor: The University of Michigan Press. R.) .. “Robust Linear Discriminant Trees.” Machine Learning. J. Conf. San Francisco. L.. of European Conference on Machine Learning (ECML94). [Jabbour. J. The Organization of Behaviour. E. J. on AI. 1. J. “Escaping Brittleness. 1994.. “Probably Approximately Correct Learning.. Santa Fe Inst. January. 1990. Volume 2. New York: AddisonWesley. of the IEEE Pwer Engineering Society Summer Meeting. 1975. 1992] Koza.” Proc. J. [Kohavi. [Hebb. MA: MIT Press. 1992. & Palmer. Genetic Programming II: Automatic Discovery of Reusable Programs. [Holland. and Stone. New York: Academic Press. Krogh. [Kaelbling. 1993] Kaelbling. [Kolodner. New York: John Wiley. [John. Marin. 1993. J. pp. Genetic Programming: On the Programming of Computers by Means of Natural Selection. 1990] Haussler. J. [Koza.. 1991. 1994] Hirsh. 17.. Ft... P. 1995.. San Francisco: Morgan Kaufmann... G. 1994] Kohavi. . Cambridge. T.” Proc. K. 1986] Holland.BIBLIOGRAPHY 173 [Haussler. D. “BottomUp Induction of Oblivious ReadOnce Decision Graphs. Lauderdale. MA: MIT Press. of the Conf. 1993. et al. et al. (Second edition printed in 1992 by MIT Press. MA. 1994. 1993] Kolodner. vol. K.
1943] McCulloch. 311365. 1986] Laird. pp.” Machine Learning.. 293321. P. [Langley. “A Logical Calculus of the Ideas Immanent in Nervous Activity. 5. Reprinted in Buchanan.. Symp.” Network. D... “How Fast Can a Thresha a old Gate Learn?. W. 1988] Littlestone. 1988. (eds.. G. 4:6785. England: Ellis Horwood. Scotland.” in Hanson. 1993. 381414.). N. N. “Learning Quickly When Irrelevant Attributes Abound: A New LinearThreshold Algorithm. S. Morgan Kaufmann. 1994. [Lin. 182189. [Marchand & Golea.” Machine Learning. 1993. Boston. “Scaling Up Reinforcement Learning for Robot Control.” Machine Learning 2: 285318. D. R. Kluwer Academic Publishers. [Mahadevan & Connell.. W. “Some Directions in Machine Intelligence.. pp. 1993] Marchand. Tenth Intl. and Newell..” Artiﬁcial Intelligence. 1993] Lin. Readings in Knowledge Acquisition and Learning. G. 1986.174 BIBLIOGRAPHY [Laird. “Automatic Programming of BehaviorBased Robots Using Reinforcement Learning... pp. 1996] Langley.. Chichester. “SelfImproving Reactive Agents Based on Reinforcement Learning. 1992. Vol. Cambridge. and Rivest. and Teaching. “Areas of Application for Machine Learning. [Lin.. et al.. [Minton.. 1988. Planning. (eds. Volume 1: Constraints and Prospects. and Tur´n. pp. H. The Turing Institute. 8. on Machine Learning.” Proc. Elements of Machine Learning.. J. [Langley. S. and Connell. “Chunking in Soar: The Anatomy of a General Learning Mechanism.. P. “On Learning Simple Neural Concepts: From Halfspace Intersections to Neural Decision Lists.” unpublished manuscript. Learning Search Control Knowledge: An ExplanationBased Approach. San Francisco. pp. 1994. pp. San Francisco: Morgan Kaufmann. B. 1992. 1994] Lavraˇ. MA: MIT Press. 1943. 115133. 1146.. 1996. Inductive Logic Proc z c z gramming. S. and Wilkins. 1994] Maass. CA.” Bulletin of Mathematical Biophysics. A. S. M. J.. San Francisco: Morgan Kaufmann. [McCulloch & Pitts... Glasgow. and Golea.). 518535. Rosenbloom... of Fifth Int’l. W. 1992] Mahadevan. . 1988] Minton. [Littlestone. 1992] Langley. Chicago: University of Chicago Press. 55. P. MA. 1. 1992. L. Conf. on Knowledge Engineering. [Lavraˇ & Dˇeroski. and Pitts. Sevilla. Drastal. S. and Dˇeroski. L. Computational Learning Theory and Natural Learning Systems.” Proc. 1992. 1992] Lin.. M... [Maass & Tur´n. [Michie. 1993. 1992] Michie.
” in Hanson. [Mueller & Page. A. 209. P. New York: Wiley. San Francisco: Morgan Kaufmann. J.. 1988. Symbolic Computing with Lisp and Prolog. 435451.. pp. 1991] Muggleton.” Machine Learning. A. S.. San Francisco: Morgan Kaufmann. pp. T.. and Dietterich. pp. [Muroga. 1991. 96–107. 103130... and Function Approximators. [Mitchell. T.. Reprinted in Shavlik.). 8. S. R. October. New York: John Wiley & Sons. et al. 1982. San Francisco: Morgan Kaufmann. Eﬃcient Memorybased Learning for Robot Control. A. et al. and Atkeson. 573587. A. W. Reprinted in Shavlik.” Machine Learning. T. R. and Petsche. 1992] Muggleton. Hill.. PhD. Readings in Machine Learning.. Computer Laboratory. T.. 1986. pp. et al.. Judd. and Johnson. and Dietterich.. Thesis. 18:203226. J... San Francisco: Morgan Kaufmann. 1990. Reprinted in Shavlik. S. [Moore. [Moore. 1990. 1991] Natarajan.... (eds. [Muggleton. S. 1991.. Smoothers. B. Cambridge: MIT Press.. San Francisco: Morgan Kaufmann.. 1993. 1994. [Natarjan.. “An Empirical Investigation of Brute Force to Choose Features. London: Academic Press. 1990. 1971] Muroga. 1971. Readings in Machine Learning. 1990. Threshold Logic and its Applications. and Dietterich. T. 1:1. C. “ExplanationBased Generalization: A Unifying View. 1990. “Quantitative Results Concerning the Utility of ExplanationBased Learning. M. Readings in Machine Learning.BIBLIOGRAPHY 175 [Minton. S..).. 3. 1990] Minton. 1992. J. R. J. 1992. pp. 1986] Mitchell. Machine Learning: A Theoretical Approach. .” New Generation Computing. University of Cambridge.” in Moody. “Generalization as Search. Advances in Neural Information Processing Systems 4.. [Mitchell. S. 295318. S. T. “Inductive Logic Programming. Inductive Logic Programming. pp. Computational Learning Theory and Natural Learning Systems. 13.” Artiﬁcial Intelligence. 1988] Mueller. (eds. J. [Muggleton. and Lippman. and Page.” Artiﬁcial Intelligence. Hanson. [Moore. “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time.. “Fast. Vol. Technical Report No. D. 1982] Mitchell.. 1992] Moore. 1994] Moore. 1990] Moore. 42. [Moore & Atkeson. 1993] Moore.. Robust Adaptive Control by Learning Only Forward Models. 363392.
D. 1990] Quinlan. N. “Boolean Feature Discovery in Empirical Learning. San Francisco: Morgan Kaufmann.” Machine Learning. 1992 Australian Artiﬁcial Intelligence Conference.. “The Utility of Knowledge in Inductive Learning. 1990.. M.. “Learning Logical Deﬁnitions from Relations.” Proc. “Inferring Decision Graphs using the Minimum Message Length Principle. New York: McGrawHill. [Pagallo & Haussler. San Francisco: MorganKaufmann. [Nilsson. Ross.) [Oliver. J.. and Dietterich. 1986] Quinlan. J. Ross. 3. 1989..1. “Induction of Decision Trees.” Machine Learning. [Peterson. J. 5794. March.. Final Report on Contract AF30(602)3448. 5. Reprinted in Shavlik. RADCTR65257. vol. 1961. 239266. C. 1986... Error Correcting Codes. (eds. [Pomerleau. March 1990.” Information and Computation. San Francisco: Morgan Kaufmann. 1965. Griﬃss Air Force Base. [Quinlan. 1992] Oliver. G. 1991. 1990. “Generating Production Rules from Decision Trees. & Wallace. (This book is a reprint of Learning Machines: Foundations of Trainable PatternClassifying Systems. and Haussler..” In IJCAI87: Proceedings of the Tenth Intl. on Artiﬁcial Intelligence. 1992] Pazzani. N. 57–69. 1991] Pomerleau. 429435. September. T. Neural Network Perception for Mobile Robot Guidance. J. [Quinlan. 1993] Pomerleau.. [Quinlan & Rivest. J. Rome Air Development Center (Now Rome Laboratories). 7199. pp. Advances in Neural Information Processing Systems. 1965] Nilsson. “Inferring Decision Trees Using the Minimum Description Length Principle. The Mathematical Foundations of Learning Machines. “Rapidly Adapting Artiﬁcial Neural Networks for Autonomous Navigation. D. 1993. 1990] Pagallo. New York: John Wiley. 1990] Nilsson. D. 1:81–106. Readings in Machine Learning. pp. 80:227–248.. no. New York. Dowe. Boston: Kluwer Academic Publishers. Dowe. D. W. 1989] Quinlan. 1987] Quinlan. and Rivest. 3047.. “Theoretical and Experimental Investigations in Trainable PatternClassifying Systems. [Pazzani & Kibler. Report No. [Pomerleau. P. D. 1992. .” Machine Learning. R. 1990. 1961] Peterson. 1965.176 BIBLIOGRAPHY [Nilsson.” Tech. and Wallace. pp. J. pp. Ron... et al. 1987. [Quinlan. R.” in Lippmann. and Kibler. J. San Francisco: Morgan Kaufmann. J.” Machine Learning. 1992.).5. Joint Conf. 9.
E.. C4.” In Rumelhart. R. Introduction to Stochastic Dynamic Programming. pp.. G. [Shavlik & Dietterich. R. Drastal. [Shavlik. and Williams. 14:465471. 1988. J. and Norvig. L. 1990] Shavlik. T. P. . Englewood Cliﬀs. W. San Francisco: Morgan Kaufmann. and Churchland.. Readings in Machine Learning. & Churchland. “A Reinforcement Learning Method for Maximizing Undiscounted Rewards.. Koch. 318–362. J. A. April 1962. S.. 1986] Rumelhart.” Science. 1959] Samuel. Ross. [Quinlan. pp. J. G... R. and Rivest. Hinton. [Russell & Norvig 1995] Russell. [Rissanen... S.. C. Stanford Electronics Labs.” Automatica. and Dietterich. J.5: Programs for Machine Learning. 1993.” IBM Journal of Research and Development. 298305. 229246. 1988] Sejnowski. 15561.. [Sejnowski.” Proc. D. Computational Learning Theory and Natural Learning Systems. 6. Washington: Spartan Books.. Hinton. and McClelland. J.. 1990. “Learning Internal Representations by Error Propagation. [Rumelhart. Artiﬁcial Intelligence: A Modern Approach. Stanford. R...) Parallel Distributed Processing. 1987] Rivest. 1991. “Symbolic and Neural Learning Algorithms: An Experimental Comparison. 2. [Ridgway. and Towell.). G. R. T. on Machine Learning. pp. San Francisco: Morgan Kaufmann.BIBLIOGRAPHY 177 [Quinlan. J. NJ: Prentice Hall. MA: MIT Press. 111143. Mooney. 1995. [Ross. E. 1993] Schwartz. Cambridge. P. 1978. “Comparing Connectionist and Symbolic Learning Methods. “Some Studies in Machine Learning Using the Game of Checkers. S. Rep. [Rivest. 1958] Rosenblatt. 1993. 445456. 1983] Ross. Mooney.. C.. CA. Volume 1: Constraints and Prospects. PhD thesis. 3:211229. “Learning Decision Lists. 1994. (eds. E. “Computational Neuroscience.. July 1959.” in Hanson. Tech. 1983. 1962] Ridgway. F. 241: 12991306.” Machine Learning.. D... 1991] Shavlik. A.. An Adaptive Logic System with Generalizing Properties. 1978] Rissanen. Tenth Intl... J. Conf. & Towell. 1961. Principles of Neurodynamics. 1993] Quinlan. [Rosenblatt.. Koch.. Vol 1. & Williams. New York: Academic Press. L. San Francisco: Morgan Kaufmann. “Modeling by Shortest Data Description. (eds. [Samuel. 1986. [Schwartz.. 1987. 1994] Quinlan.” Machine Learning.
Shavlik. & Noordweier... G. Conf. R. J. ... Vol. Michie.. Englewood Cliﬀs. NJ: PrenticeHall.. “On the Uniform Convergence of Relative Frequencies. and Chervonenkis.. V. 19891994] Advances in Neural Information Processing Systems..” Proc.. 1992] Watkins. G. [Valiant.” Machine Learning.” in Proceedings of the Ninth Annual Conference of the Cognitive Science Society. S. [Various Editors..” Proc. 216224. San Francisco: Morgan Kaufmann. Shavlik. 16.. C. “Learning to Predict by the Methods of Temporal Diﬀerences. [Sutton. 1989.. [Watkins & Dayan. 1989] Unger. 1971] Vapnik. H. San Francisco: Morgan Kaufmann. Nov. 1984. San Francisco: Morgan Kaufmann. 1990. “A Theory of the Learnable. C. and Noordweier. 1987. A. 1990. 1989] Utgoﬀ.). Paramount Publishing International. 257277. 1988] Sutton. on Artiﬁcial Intelligence. Conf.. D.. “Incremental Induction of Decision Trees. 1987] Sutton. pp. C. [Towell. 1992. A. Vol. G. (eds. 1992. [Tesauro. [Vapnik & Chervonenkis. of the Seventh Intl. 3/4. 1992] Tesauro. Michie. nos. and Reacting Based on Approximating Dynamic Programming. 1990] Sutton. The Essence of Logic Circuits. “Technical Note: QLearning. M. 2. vols 1 through 6. 977984. 1992.” Machine Learning. R.178 BIBLIOGRAPHY [Sutton & Barto. 1992] Towell G. J. 264280. & Spiegalhalter. D. and Spiegalhalter. [Unger. 1984] Valiant.” Machine Learning 3: 944.” Machine Learning. S. Machine Learning. Eighth Natl. Advances in Neural Information Processing Systems. R... Planning. 11341142. 27. “A TemporalDiﬀerence Model of Classical Conditioning. “Integrated Architectures for Learning. 4:161–186... pp. J. pp. Theory of Probability and its Applications.. 861866.. 1994] Taylor. 1989. 1990] Towell. 1988. L. 1971. and Dayan. 8. R. No. pp.” in Moody. pp. “Interpretation of Artiﬁcial Neural Networks: Mapping KnowledgeBased Neural Networks into Rules.” Communications of the ACM. and Barto. “Practical Issues in Temporal Diﬀerence Learning. and Lippmann. Neural and Statistical Classiﬁcation.. P. [Towell & Shavlik. 8. NJ: Erlbaum. J. and Shavlik. [Sutton. Hanson.. [Utgoﬀ.. P. S. 4. “Reﬁnement of Approximate Domain Theories by KnowledgeBased Artiﬁcial Neural Networks. on Machine Learning. [Taylor. 1989 1994.. pp.. 279292. Hillsdale. S.
. J.” Proc. pp. et al. R. [Weiss & Kulikowski. S. pp. 1985] Widrow. 1974] Werbos. 428437. and Kulikowski. 1962] Widrow. 1991] Weiss.. Princeton University. et al.). Fifth Intl. 1990. paper CP601261. San Francisco: Morgan Kaufmann. on Methodologies for Intelligent Systems.” in Yovits. J. 1990] Wnek. Princeton.. 14151442. 1989] Watkins. R. England. no. Jacobi.. B.. Conf.. IEEE.. [Winder.. 435461. “Generalization and Storage in Networks of Adaline Neurons. C. 1990. 78.. M. B. DC: Spartan Books. Englewood Cliﬀs. H. of the AIEE Symp. and Lehr. Threshold Logic. University of Illinois at UrbanaChampaign.D. on Switching Circuits and Logical Design. C. Thesis. and Stearns. [Widrow & Lehr. [Wnek.BIBLIOGRAPHY 179 [Watkins.. 9. B... PhD Dissertation. Selforganizing Systems—1962.. “Single Stage Threshold Logic. Harvard University. University of Cambridge. 1990] Widrow.) . NJ.” in Proc. [Widrow. pp. September. Report MLI902. S. vol.. pp. [Widrow & Stearns. Madaline and Backpropagation. [Werbos. 1961. PhD Thesis. 1962. 1962. NJ: PrenticeHall. “Comparing Learning Paradigms via Diagrammatic Visualization. 1962] Winder. 1991. Learning From Delayed Rewards. 321332. Computer Systems that Learn. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph. Washington. Symp. (Also Tech. 1974. P. 1989. C. [Winder. “30 Years of Adaptive Neural Networks: Perceptron. and Goldstein (eds.” Proc. 1961] Winder. Adaptive Signal Processing. A.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.