05 1-LearningObserv

Learning From Observations
“In which we describe agents that can

improve their behavior through diligent study
of their own experiences.”
-Artificial Intelligence: A Modern Approach
Prepared by: San Chua, Natalie Weber, Henry Kwong

Outline
• Learning agents
• Inductive learning
• Learning decision trees
– Example of a decision tree
– Decision-tree-learning algorithm
– Accessing the performance
• Learning general logical descriptions
– Current-best hypothesis search algorithm
– Version space learning algorithm
• Computational learning theory
• Summary
Learning Agent
• Four Components
1. Performance Element: collection of knowledge
and procedures to decide on the next action.
E.g. walking, turning, drawing, etc.
2. Learning Element: takes in feedback from the
critic and modifies the performance element
accordingly.
Learning Agent (con’t)
- Critic: provides the learning element with

information on how well the agent is doing based
on a fixed performance standard.
E.g. the audience
- Problem Generator: provides the performance
element with suggestions on new actions to take.
Designing a Learning Element
• Depends on the design of the performance element

• Four major issues
1. Which components of the performance element
to improve
2. The representation of those components
3. Available feedback
4. Prior knowledge
Components of the Performance
Element
• A direct mapping from conditions on the current state
to actions
• Information about the way the world evolves
• Information about the results of possible actions the
agent can take
• Utility information indicating the desirability of world
states
Representation
• A component may be represented using different

representation schemes
• Details of the learning algorithm will differ depending
on the representation, but the general idea is the
same
• Functions are used to describe a component
Feedback & Prior Knowledge
• Supervised learning: inputs and outputs available

• Reinforcement learning: evaluation of action
• Unsupervised learning: no hint of correct outcome
• Background knowledge is a tremendous help in
learning
Outline
• Learning agents
• Summary
Inductive Learning
• Key idea:
– To use specific examples to reach general
conclusions
• Given a set of examples, the system tries to
approximate the evaluation function.
• Also called Pure Inductive Inference
Recognizing Handwritten Digits
Learning Agent
Training Examples
Recognizing Handwritten Digits
Different variations of handwritten 3’s

Bias
• Bias: any preference for one hypothesis over

another, beyond mere consistency with the
examples.
• Since there are almost always a large number of
possible consistent hypotheses, all learning
algorithms exhibit some sort of bias.
Example of Bias
Is this a 7 or a 1?
Some may be more biased
toward 7 and others more
biased toward 1.
Formal Definitions
• Example: a pair (x, f(x)), where
– x is the input,
– f(x) is the output of the function
applied to x.
• hypothesis: a function h that approximates f, given a
set of examples.
Task of Induction
• The task of induction: Given a set of examples, find

a function h that approximates the true evaluation
function f.
Outline
• Learning agents
• Summary
Decision Tree Example
Goal Predicate:
Patrons?
none full
Will wait for a table?
some
No Yes WaitEst?
>60 0-10
30-60 10-30
No Alternate? Hungry? Yes
no yes no yes
Reservation? Fri/Sat? No Yes
no yes no yes
No Yes No Yes
http://www.cs.washington.edu/education/courses/473/99wi/
Logical Representation of a Path
Patrons?
none full
some
WaitEst?
>60 0-10
30-60 10-30
Hungry?
no yes
Yes
r [Patrons(r, full)  Wait_Estimate(r, 10-30)

 Hungry(r, yes)]  Will_Wait(r)
Expressiveness of Decision Trees
• Any Boolean function can be written as a decision tree
• Limitations
– Can only describe one object at a time.
– Some functions require an exponentially large decision tree.
• E.g. Parity function, majority function
• Decision trees are good for some kinds of functions, and bad for
others.
• There is no one efficient representation for all kinds of
functions.
Principle Behind the Decision-Tree-Learning
Algorithm
• Uses a general principle of inductive learning often
called Ockham’s razor:
“The most likely hypothesis is the simplest one that
is consistent with all observations.”
Outline
• Learning agents
• Summary
Decision-Tree-Learning Algorithm
• Goal: Find a relatively small decision tree that is

consistent with all training examples, and will
correctly classify new examples.
• Note that finding the smallest decision tree is an
intractable problem. So the Decision-Tree-Algorithm
uses some simple heuristics to find a “smallish” one.
Getting Started
• Come up with a set of attributes to describe the

object or situation.
• Collect a complete set of examples (training set)
from which the decision tree can derive a
hypothesis to define (answer) the goal predicate.
The Restaurant Domain
Attributes Goal
Example Fri Hun Pat Price Rain Res Type Est WillWait
X1 No Yes Some $$$ No Yes French 0-10 Yes
X2 No Yes Full $ No No Thai 30-60 No
X3 No No Some $ No No Burger 0-10 Yes
X4 Yes Yes Full $ No No Thai 10-30 Yes
X5 Yes No Full $$$ No Yes French >60 No
X6 No Yes Some $$ Yes Yes Italian 0-10 Yes
X7 No No None $ Yes No Burger 0-10 No
X8 No Yes Some $$ Yes Yes Thai 0-10 Yes
X9 Yes No Full $ Yes No Burger >60 No
X10 Yes Yes Full $$$ No Yes Italian 10-30 No
X11 No No None $ No No Thai 0-10 No
X12 Yes Yes Full $ No No Burger 30-60 Yes
Will we wait, or not?

Splitting Examples by Testing on
Attributes
+ X1, X3, X4, X6, X8, X12 (Positive examples)
- X2, X5, X7, X9, X10, X11 (Negative examples)
Attributes (con’t)
Patrons?
none full
some
+ +X1, X3, X6, X8 +X4, X12
- X7, X11 - - X2, X5, X9, X10
Attributes (con’t)
Patrons?
none full
some
+ +X1, X3, X6, X8 +X4, X12
- X7, X11 - - X2, X5, X9, X10
No Yes
Splitting Examples by Testing on Attributes
(con’t)
Patrons?
none full
some
+ +X1, X3, X6, X8 +X4, X12
- X7, X11 - - X2, X5, X9, X10
No Yes Hungry?
no yes
+ X4, X12 +
- X2, X10 - X5, X9
What Makes a Good Attribute?
Better
Patrons?
none full Attribute
some
+ +X1, X3, X6, X8 +X4, X12
- X7, X11 - - X2, X5, X9, X10
Type? Not As Good

French Burger An Attribute
Italian Thai
+ X1 +X6 + X4,X8 +X3, X12
- X5 - X10 - X2, X11 - X7, X9
Final Decision Tree
Patrons?
none some full
No Yes Hungry?
No Yes
Type? No
French Italian Thai burger

Yes No Fri/Sat? Yes
no yes
No Yes
Original Decision Tree Example
Goal Predicate:
Patrons?
none full
Will wait for a table?
some
No Yes WaitEst?
>60 0-10
30-60 10-30
No Alternate? Hungry? Yes
no yes no yes
Reservation? Fri/Sat? No Yes
no yes no yes
No Yes No Yes
Outline
• Learning agents
• Summary
Assessing the Performance of the
Learning Algorithm
• A learning algorithm is good if it produces
hypotheses that do a good job of predicating the
classifications of unseen examples
• Test the algorithm’s prediction performance on a set
of new examples, called a test set.
Methodology in Accessing
Performance
1. Collect a large set of examples.
2. Divide it into 2 disjoint set: the training set and the test set. It
is very important that these 2 sets are separate so that the
algorithm doesn’t cheat. Usually this division of examples is
done randomly.
3. Use the learning algorithm with the training set as examples to
generate a hypothesis H.
Methodology (con’t)
4. Measure the percentage of examples in the test set

that are correctly classified by H.
5. Repeat steps 1 to 4 for different sizes of training sets
and different randomly selected training sets of
each size.
Analyzing the Results
Learning Curve for the Decision Tree Algorithm
(On examples in the restaurant domain)
1.0
0.9
0.8
% correct
on test set 0.7
Happy Graph
0.6
0.5
0.4
0 20 40 60 80 100
Training set size
“Artificial Intelligence A Modern Approach”, Stuart Russel Peter Norwig
Overfitting
• Overfitting is what happens when a learning algorithm finds
meaningless “regularity” in the data.
• Caused by irrelevant attributes.
• Solution: decision tree pruning.
– Resulting decision tree is.
• Smaller.
• More tolerant to noise.
• More accurate in its predictions.
Practical Uses of Decision Tree Learning
• Designing oil platform equipment.
• Learning to fly a plane.
• Diagnosing heart attacks.
Outline
• Learning agents
• Summary
Learning General Logical Description
• Key idea:
– Look at inductive learning generally
– Find a logical description that is equivalent
to the (unknown) evaluation function
• Make our hypothesis more or less specific to match
the evaluation function.
Outline
• Learning agents
• Summary
Current-best-hypothesis Search
• Key idea:
– Maintain a single hypothesis throughout.
– Update the hypothesis to maintain consistency as
a new example comes in.
Definitions
• Positive example: an instance of the hypothesis

• Negative example: not an instance of the hypothesis
• False negative example: the hypothesis predicts it
should be a negative example but it is in fact positive
• False positive example: should be positive but it is
actually negative.
Current-best-hypothesis Search
Algorithm
1. Pick a random example to define the initial
hypothesis
2. For each example,
– In case of a false negative:
• Generalize the hypothesis to include it
– In case of a false positive:
• Specialize the hypothesis to exclude it
3. Return the hypothesis
How to Generalize
a) Replacing Constants with Variables:

Object(Animal,Bird)  Object (X,Bird)
b) Dropping Conjuncts:
Object(Animal,Bird) & Feature(Animal,Wings)
 Object(Animal,Bird)
c) Adding Disjuncts:
Feature(Animal,Feathers)  Feature(Animal,Feathers) v
Feature(Animal,Fly)
d) Generalizing Terms:
Feature(Bird,Wings)  Feature(Bird,Primary-Feature)
http://www.pitt.edu/~suthers/infsci1054/8.html
How to Specialize
a) Replacing Variables with Constants:

Object (X, Bird)  Object(Animal, Bird)
b) Adding Conjuncts:
Object(Animal,Bird)  Object(Animal,Bird) &
Feature(Animal,Wings)
c) Dropping Disjuncts:
Feature(Animal,Feathers) v Feature(Animal,Fly)
 Feature(Animal,Fly)
d) Specializing Terms:
Feature(Bird,Primary-Feature)  Feature(Bird,Wings)
http://www.pitt.edu/~suthers/infsci1054/8.html
What do all these mean?
• Let’s look at some examples...

Generalize and Specialize
• Must be consistent with all other examples

• Non-deterministic
– At any point there may be several possible
specializations or generalizations that can be
applied.
Potential Problem of Current-best-
hypothesis Search
• Extension made not necessarily lead to the simplest
hypothesis.
• May lead to an unrecoverable situation where no
simple modification of the hypothesis is consistent
with all of the examples.
• The program must backtrack to a previous choice
point.
Problem of Backtracking
• Require large space to store all examples

• Need to check all previous instances after each
modification of the hypothesis.
• Search and check all these previous instances over
again after each modification is very expensive
Outline
• Learning agents
• Summary
Version Space Learning Algorithm
• Least-Commitment Search
• No backtracking
• Key idea:
– Maintain the most general and specific hypotheses
at any point in learning. Update them as new
examples come in.
Definitions
• Version space: a set of all hypotheses consistent

with examples seen so far
• Boundary sets: sets of hypotheses defining
boundary on which hypotheses are consistent with
examples
– Most general (G-set) and most specific (S-set)
boundary sets
Requirement
• Usually requires an enormous number of hypotheses

to record.
• An assumption: a partial ordering (more-specific-than
ordering) exists on all of the hypotheses in the space
– hierarchical
• Boundary sets circumscribing the space of possible
hypotheses.
– G-set(the most general boundary)
– S-set (the most specific boundary)
1. Initially, the G-set is True, and the S-set is False

2. For each new example, there are 6 possible cases:
a) false positive for Si in S
• Si is too general - no consistent specializations.
• Throw it out.
b) false negative for Si in S
• Si is too specific.
• Replace it with its generalizations.
(con’t)
c) false positive for Gi in G
• Gi is too general.
• Replace it with its specializations.
d) false negative for Gi in G
• Gi is too specific - no consistent generalizations.
• Throw it out.
e) Si more general than some other hypothesis in S or G
• Throw it out.
f) Gi more specific than some other hypothesis in S or G
• Throw it out.
(con’t)
3. Repeat the process until one of three things
happens:
a) Only one hypothesis left in the version space.
• This is the answer we want.
b) The version space collapses, i.e. either G or S
becomes empty.
• This means there are no consistent hypotheses.
c) We run out of examples while the version space
still has several hypotheses.
• Use their collective evaluation (breaking disagreements
with majority vote).
Advantages of the Algorithm
• Never favor one possible hypothesis over another; all

remaining hypotheses are consistent
• Never require backtracking
Potential Problems
• Does not deal with noise

– Not very practical in real-world learning problem
• Unlimited disjunctions in the version space leads to
– The S-set has a single most specific hypothesis
– The G-set has a most general hypothesis
Outline
• Learning agents
• Summary
Why Learning Works
• Problem: How can you know if a theory will

accurately predict the future?
OR
How can you know that a hypothesis is
close to the target function if you don’t
know what the target function is?
• Answers provided by Computational Learning Theory
Computational Learning Theory
• Main principle: “any hypothesis that is seriously

wrong will almost certainly be ‘found out’ with high
probability after a small number of examples,
because it will make an incorrect prediction.”
• Assumes that the training and test sets are drawn
randomly
Summary
• Learning agents
References
• Russel, S. and P. Norvig (1995). Artificial Intelligence - A

Modern Approach. Upper Saddle River, NJ, Prentice Hall.
• http://www.pitt.edu/~suthers/infsci1054/8.html
• http://enuxsa.eas.asu.edu/~cse471/4-20

05 1-LearningObserv

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 1-LearningObserv

Uploaded by

Copyright:

Available Formats

Learning From Observations

“In which we describe agents that can

Prepared by: San Chua, Natalie Weber, Henry Kwong

- Critic: provides the learning element with

• Depends on the design of the performance element

• A component may be represented using different

• Supervised learning: inputs and outputs available

Different variations of handwritten 3’s

• Bias: any preference for one hypothesis over

• The task of induction: Given a set of examples, find

Reservation? Fri/Sat? No Yes

r [Patrons(r, full)  Wait_Estimate(r, 10-30)

• Goal: Find a relatively small decision tree that is

• Come up with a set of attributes to describe the

Will we wait, or not?

Type? Not As Good

none some full

French Italian Thai burger

Reservation? Fri/Sat? No Yes

4. Measure the percentage of examples in the test set

• Positive example: an instance of the hypothesis

a) Replacing Constants with Variables:

a) Replacing Variables with Constants:

• Let’s look at some examples...

• Must be consistent with all other examples

• Require large space to store all examples

• Version space: a set of all hypotheses consistent

• Usually requires an enormous number of hypotheses

1. Initially, the G-set is True, and the S-set is False

• Never favor one possible hypothesis over another; all

• Does not deal with noise

• Problem: How can you know if a theory will

• Main principle: “any hypothesis that is seriously

• Russel, S. and P. Norvig (1995). Artificial Intelligence - A

You might also like