You are on page 1of 107

CS 240

Learning (Machine Learning) &


Planning

•Intro. to ML Emdad Khan 1


1
Machine Learning: Short Introduction
• Learning Agent: An agent that improves its
performance through learning over time (as you have
seen in the first few lectures).

• Learning is very important for most Agents.


Accordingly, a new filed called “Machine Learning” is
born.

• What is Machine Learning?


– Basically learning the behavior, features or characteristics
from the data itself, instead of writing algorithms
directly that can model the data.

•Intro. to ML Emdad Khan 2


Machine Learning: Short Introduction

– a branch of artificial intelligence, is a scientific discipline


concerned with the design and development of algorithms
that take as input empirical data, such as that from
sensors or databases, and yield patterns or predictions
thought to be features of the underlying mechanism that
generated the data.

•Intro. to ML Emdad Khan 3


Machine Learning: Short Introduction
• Why Machine Learning?
– In this Information Age, many applications are highly data driven
with large amount of data – NLU, Imaging, Medical Applications,
Mining, Weather, Information Retrieval, Search….

– If Data had a Mass, our earth would be so heavy that it would


become a Black Hole!

– Very much needed when we have large amount of data that is very
hard (or sometimes impossible) to model with trying to write
algorithms directly.

– NOTE: We are still writing algorithms BUT TO HOW to LEARN


from Data instead of observe the data and come up with algorithms
to model it (or writing algorithms by observing system behavior).

•Intro. to ML Emdad Khan 4


Machine Learning: Short Introduction
• Types of Machine Learning
– Supervised Learning : a training set of examples with correct
response.

– Unsupervised Learning: a training set of examples without any


correct response (inputs having something in common are
categorized together)

– Reinforcement Learning: (betn. Supervised and Unsupervised) The


algorithm gets told when the answer is wrong but does NOT tell how
to correct it. Also known “Learning with Critic”.

– Evolutionary Learning: (based on biological evolution) Biological


organisms adapt to improve their survival rates and chance to having
offspring in their environment.

•Intro. to ML Emdad Khan 5


Machine Learning: Short Introduction
• Types of Algorithms (High Level)
– Statistical

– Non-statistical (Neural Networks– e.g. Back Propagation -finding


optimal value(s) by solving some differential equations)

– Evolutionary Learning / Genetic Algorithms

The Rest of the slides show more details and are from “Artificial
Intelligence by David Poole et al”, the 2nd Text Book.

•Intro. to ML Emdad Khan 6


Characterizations of Learning

Find the best representation given the data.


Delineate the class of consistent representations given the
data.
Find a probability distribution of the representations given
the data.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 14
Learning

Learning is the ability to improve one’s behavior based on


experience.
The range of behaviors is expanded: the agent can do
more.
The accuracy on tasks is improved: the agent can do
things better.
The speed is improved: the agent can do things faster.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 1
Components of a learning problem

The following components are part of any learning problem:


task The behavior or task that’s being improved.
For example: classification, acting in an environment
data The experiences that are being used to improve
performance in the task.
measure of improvement How can the improvement be
measured?
For example: increasing accuracy in prediction, new skills
that were not present initially, improved speed.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 2
Learning task

Experiences/
Problem/
Data
Task

Model(s)
Learner Reasoner

Background knowledge/ Answer/


Bias Performance

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 3
Learning architecture

Experiences/
Problem/
Data
Task

Model(s)
Learner Reasoner

Background knowledge/ Answer/


Bias Performance

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 4
Choosing a representation

The richer the representation, the more useful it is for


subsequent problem solving.
The richer the representation, the more difficult it is to
learn.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 5
Common Learning Tasks

Supervised classification Given a set of pre-classified


training examples, classify a new instance.
Unsupervised learning Find natural classes for examples.
Reinforcement learning Determine what to do based on
rewards and punishments.
Analytic learning Reason faster using experience.
Inductive logic programming Build richer models in
terms of logic programs.
Statistical relational learning learning relational
representations that also deal with uncertainty.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 6
Example Classification Data

Training Examples:
Action Author Thread Length Where
e1 skips known new long home
e2 reads unknown new short work
e3 skips unknown old long work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work
New Examples:
e7 ??? known new short work
e8 ??? unknown new short work

We want to classify new examples on feature Action based on


the examples’ Author , Thread, Length, and Where.
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 7
Feedback

Learning tasks can be characterized by the feedback given to


the learner.
Supervised learning What has to be learned is specified
for each example.
Unsupervised learning No classifications are given; the
learner has to discover categories and regularities in the
data.
Reinforcement learning Feedback occurs after a
sequence of actions.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 8
Measuring Success

The measure of success is not how well the agent


performs on the training examples, but how well the
agent performs for new examples.
Consider two agents:
◮ P claims the negative examples seen are the only
negative examples. Every other instance is positive.
◮ N claims the positive examples seen are the only
positive examples. Every other instance is negative.
Both agents correctly classify every training example, but
disagree on every other example.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 9
Bias

The tendency to prefer one hypothesis over another is


called a bias.
Saying a hypothesis is better than N’s or P’s hypothesis
isn’t something that’s obtained from the data.
To have any inductive process make predictions on unseen
data, you need a bias.
What constitutes a good bias is an empirical question
about which biases work best in practice.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 10
Learning as search

Given a representation and a bias, the problem of learning


can be reduced to one of search.
Learning is search through the space of possible
representations looking for the representation or
representations that best fits the data, given the bias.
These search spaces are typically prohibitively large for
systematic search. E.g., use gradient descent.
A learning algorithm is made of a search space, an
evaluation function, and a search method.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 11
Noise

Data isn’t perfect:


◮ some of the attributes are assigned the wrong value
◮ the attributes given are inadequate to predict the
classification
◮ there are examples with missing attributes
overfitting occurs when distinctions appear in the
training data, but not in the unseen examples.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 12
Errors in learning

Errors in learning are caused by:


Limited representation (representation bias)
Limited search (search bias)
Limited data (variance)
Limited features (noise)

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 13
Characterizations of Learning

Find the best representation given the data.


Delineate the class of consistent representations given the
data.
Find a probability distribution of the representations given
the data.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 14
Supervised Learning

Given:
a set of inputs features X1 , . . . , Xn
a set of target features Y1 , . . . , Yk
a set of training examples where the values for the input
features and the target features are given for each
example
a new example, where only the values for the input
features are given
predict the values for the target features for the new example.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1
Supervised Learning

Given:
a set of inputs features X1 , . . . , Xn
a set of target features Y1 , . . . , Yk
a set of training examples where the values for the input
features and the target features are given for each
example
a new example, where only the values for the input
features are given
predict the values for the target features for the new example.
classification when the Yi are discrete
regression when the Yi are continuous

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 2
Example Data Representations
A travel agent wants to predict the preferred length of a trip,
which can be from 1 to 6 days. (No input features).
Two representations of the same data:
— Y is the length of trip chosen.
— Each Yi is an indicator variable that has value 1 if the
chosen length is i, and is 0 otherwise.
Example Y Example Y1 Y2 Y3 Y4 Y5 Y6
e1 1 e1 1 0 0 0 0 0
e2 6 e2 0 0 0 0 0 1
e3 6 e3 0 0 0 0 0 1
e4 2 e4 0 1 0 0 0 0
e5 1 e5 1 0 0 0 0 0
What is a prediction?
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 3
Evaluating Predictions

Suppose we want to make a prediction of a value for a target


feature on example e:
oe is the observed value of target feature on example e.
pe is the predicted value of target feature on example e.
The error of the prediction is a measure of how close pe
is to oe .
There are many possible errors that could be measured.
Sometimes pe can be a real number even though oe can only
have a few values.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 4
Measures of error

E is the set of examples, with single target feature. For e ∈ E ,


oe is observed value and pe is predicted value:
X
absolute error L1 (E ) = |oe − pe |
e∈E

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 5
Measures of error

E is the set of examples, with single target feature. For e ∈ E ,


oe is observed value and pe is predicted value:
X
absolute error L1 (E ) = |oe − pe |
e∈E
X
sum of squares error L22 (E ) = (oe − pe )2
e∈E

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 6
Measures of error

E is the set of examples, with single target feature. For e ∈ E ,


oe is observed value and pe is predicted value:
X
absolute error L1 (E ) = |oe − pe |
e∈E
X
sum of squares error L22 (E ) = (oe − pe )2
e∈E
worst-case error : L∞ (E ) = max |oe − pe |
e∈E

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 7
Measures of error

E is the set of examples, with single target feature. For e ∈ E ,


oe is observed value and pe is predicted value:
X
absolute error L1 (E ) = |oe − pe |
e∈E
X
sum of squares error L22 (E ) = (oe − pe )2
e∈E
worst-case error : L∞ (E ) = max |oe − pe |
e∈E

number wrong : L0 (E ) = #{e : oe 6= pe }

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 8
Measures of error

E is the set of examples, with single target feature. For e ∈ E ,


oe is observed value and pe is predicted value:
X
absolute error L1 (E ) = |oe − pe |
e∈E
X
sum of squares error L22 (E ) = (oe − pe )2
e∈E
worst-case error : L∞ (E ) = max |oe − pe |
e∈E

number wrong : L0 (E ) = #{e : oe 6= pe }


A cost-based error takes into account costs of errors.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 9
Measures of error (cont.)

With binary feature: oe ∈ {0, 1}:


likelihood of the data
Y
peoe (1 − pe )(1−oe )
e∈E

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 10
Measures of error (cont.)

With binary feature: oe ∈ {0, 1}:


likelihood of the data
Y
peoe (1 − pe )(1−oe )
e∈E

entropy (number of bits to encode the data given a code


based on pval)
X
− (oe log pe + (1 − oe ) log(1 − pe ))
e∈E

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 11
Information theory overview

A bit is a binary digit.


1 bit can distinguish 2 items
k bits can distinguish 2k items
n items can be distinguished using log2 n bits
Can we do better?

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 12
Information and Probability
Let’s design a code to distinguish elements of {a, b, c, d} with
1 1 1 1
P(a) = , P(b) = , P(c) = , P(d) =
2 4 8 8
Consider the code:

a 0 b 10 c 110 d 111

This code sometimes uses 1 bit and sometimes uses 3 bits. On


average, it uses

P(a) × 1 + P(b) × 2 + P(c) × 3 + P(d) × 3


1 2 3 3 3
= + + + = 1 bits.
2 4 8 8 4
The string aacabbda has code 00110010101110.
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 13
Information Content
To identify x, we need − log2 P(x) bits.
Give a distribution over a set, to a identify a member, the
expected number of bits
X
−P(x) × log2 P(x).
x

is the information content or entropy of the


distribution.
The expected number of bits it takes to describe a
distribution given evidence e:
X
I (e) = −P(x|e) × log2 P(x|e).
x

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 14
Information Gain

Given a test that can distinguish the cases where α is true


from the cases where α is false, the information gain from
this test is:

I (true) − (P(α) × I (α) + P(¬α) × I (¬α)).

I (true) is the expected number of bits needed before the


test
P(α) × I (α) + P(¬α) × I (¬α) is the expected number of
bits after the test.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 15
Linear Predictions

8
7 L∞
6
5
4
3
2
1
0
0 1 2 3 4 5

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 16
Linear Predictions

8
7 L∞
6 L22
5
4
3 L1
2
1
0
0 1 2 3 4 5

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 17
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 18
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 19
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 20
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 21
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 22
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 23
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 24
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is (maximum + minimum)/2

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 25
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is (maximum + minimum)/2
When Y has domain {0, 1}, the prediction that
maximizes the likelihood on E is

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 26
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is (maximum + minimum)/2
When Y has domain {0, 1}, the prediction that
maximizes the likelihood on E is the empirical probability.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 27
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is (maximum + minimum)/2
When Y has domain {0, 1}, the prediction that
maximizes the likelihood on E is the empirical probability.
When Y has domain {0, 1}, the prediction that minimizes
the entropy on E is

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 28
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is (maximum + minimum)/2
When Y has domain {0, 1}, the prediction that
maximizes the likelihood on E is the empirical probability.
When Y has domain {0, 1}, the prediction that minimizes
the entropy on E is the empirical probability.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 29
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is the mean (average) value of Y .
The prediction that minimizes the absolute error on E is
the median value of Y .
The prediction that minimizes the number wrong on E is
the mode of Y .
The prediction that minimizes the worst-case error on E
is (maximum + minimum)/2
When Y has domain {0, 1}, the prediction that
maximizes the likelihood on E is the empirical probability.
When Y has domain {0, 1}, the prediction that minimizes
the entropy on E is the empirical probability.
But that doesn’t mean that these predictions minimize the
error for future predictions....
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 30
Training and Test Sets

To evaluate how well a learner will work on future predictions,


we divide the examples into:
training examples that are used to train the learner
test examples that are used to evaluate the learner
...these must be kept separate.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 31
Learning Probabilities
Empirical probabilities do not make good predictors when
evaluated by likelihood or entropy.
Why?

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 32
Learning Probabilities
Empirical probabilities do not make good predictors when
evaluated by likelihood or entropy.
Why? A probability of zero means “impossible” and has
infinite cost if there is one true case in test set.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 33
Learning Probabilities
Empirical probabilities do not make good predictors when
evaluated by likelihood or entropy.
Why? A probability of zero means “impossible” and has
infinite cost if there is one true case in test set.
Solution: (Laplace smoothing) add (non-negative)
pseudo-counts to the data.
Suppose ni is the number of examples with X = vi , and
ci is the pseudo-count:
ci + n i
P(X = vi ) = P
i ′ ci ′ + n i ′

Pseudo-counts convey prior knowledge. Consider: “how


much more would I believe vi if I had seen one example
with vi true than if I has seen no examples with vi true?”
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 34
Basic Models for Supervised Learning

Many learning algorithms can be seen as deriving from:


decision trees
linear (and non-linear) classifiers
Bayesian classifiers

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 1
Learning Decision Trees

Representation is a decision tree.


Bias is towards simple decision trees.
Search through the space of decision trees, from simple
decision trees to more complex ones.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 2
Decision trees

A decision tree (for a particular output feature) is a tree


where:
Each nonleaf node is labeled with an input feature.
The arcs out of a node labeled with feature A are labeled
with each possible value of the feature A.
The leaves of the tree are labeled with point prediction of
the output feature.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 3
Example Decision Trees

Length Length

long short long short

skips Thread skips reads with


probability 0.82
new follow_up

reads Author

unknown known

skips reads

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 4
Equivalent Logic Program

skips ← long .
reads ← short ∧ new .
reads ← short ∧ follow up ∧ known.
skips ← short ∧ follow up ∧ unknown.

or with negation as failure:

reads ← short ∧ new .


reads ← short ∧ ∼new ∧ known.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 5
Issues in decision-tree learning

Given some training examples, which decision tree should


be generated?
A decision tree can represent any discrete function of the
input features.
You need a bias. Example, prefer the smallest tree.
Least depth? Fewest nodes? Which trees are the best
predictors of unseen data?
How should you go about building a decision tree? The
space of decision trees is too big for systematic search for
the smallest decision tree.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 6
Searching for a Good Decision Tree

The input is a set of input features, a target feature and,


a set of training examples.
Either:
◮ Stop and return the a value for the target feature or a
distribution over target feature values
◮ Choose an input feature to split on.
For each value of this feature, build a subtree for those
examples with this value for the input feature.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 7
Choices in implementing the algorithm

When to stop:

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 8
Choices in implementing the algorithm

When to stop:
◮ no more input features
◮ all examples are classified the same
◮ too few examples to make an informative split

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 9
Choices in implementing the algorithm

When to stop:
◮ no more input features
◮ all examples are classified the same
◮ too few examples to make an informative split
Which feature to select to split on isn’t defined. Often we
use myopic split: which single split gives smallest error.
With multi-valued features, we can split on all values or
split values into half.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 10
Example: possible splits

skips 9
length reads 9
long short

skips 7 skips 2
reads 0 reads 9 skips 9
thread reads 9
new old

skips 3 skips 6
reads 7 reads 2

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 11
Handling Overfitting

This algorithm can overfit the data.


This occurs when noise and correlations in the training
set that are not reflected in the data as a whole.
To handle overfitting:
◮ restrict the splitting, and split only when the split is
useful.
◮ allow unrestricted splitting and prune the resulting tree
where it makes unwarranted distinctions.
◮ learn multiple trees and average them.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 12
Linear Function

A linear function of features X1 , . . . , Xn is a function of the


form:

f w (X1 , . . . , Xn ) = w0 + w1 X1 + · · · + wn Xn

We invent a new feature X0 which has value 1, to make it not


a special case.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 13
Linear Regression

Linear regression is where the output is a linear function of


the input features.

pval w (e, Y ) = w0 + w1 val(e, X1 ) + · · · + wn val(e, Xn )

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 14
Linear Regression

Linear regression is where the output is a linear function of


the input features.

pval w (e, Y ) = w0 + w1 val(e, X1 ) + · · · + wn val(e, Xn )

The sum of squares error on examples E for output Y is:


X
ErrorE (w ) = (val(e, Y ) − pval w (e, Y ))2
e∈E
X
= (val(e, Y )−(w0 +w1 val(e, X1 )+· · ·+wn val(e, Xn )))2
e∈E

Goal: find weights that minimize ErrorE (w ).

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 15
Finding weights that minimize ErrorE (w )

Find the minimum analytically.


Effective when it can be done (e.g., for linear regression).

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 16
Finding weights that minimize ErrorE (w )

Find the minimum analytically.


Effective when it can be done (e.g., for linear regression).
Find the minimum iteratively.
Works for larger classes of problems.
Gradient descent:
∂ErrorE (w )
wi ← wi − η
∂wi

η is the gradient descent step size, the learning rate.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 17
Gradient Descent for Linear Regression
1: procedure LinearLearner (X , Y , E , η)
2: Inputs
3: X : set of input features, X = {X1 , . . . , Xn }
4: Y : output feature
5: E : set of examples from which to learn
6: η: learning rate
7: initialize w0 , . . . , wn randomly
8: repeat
9: for each example e in E do
10: δ ← val(e, Y ) − pval w (e, Y )
11: for each i ∈ [0, n] do
12: wi ← wi + ηδval(e, Xi )
13: until some stopping criterion is true
14: return w0 , . . . , wn
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 18
Linear Classifier

Assume we are doing binary classification, with classes


{0, 1} (e.g., using indicator functions).
There is no point in making a prediction of less than 0 or
greater than 1.
A squashed linear function is of the form:

f w (X1 , . . . , Xn ) = f (w0 + w1 X1 + · · · + wn Xn )

where f is an activation function .


A simple activation function is the step function:

1 if x ≥ 0
f (x) =
0 if x < 0

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 19
Gradient Descent for Linear Classifiers
If the activation is differentiable, we can use gradient descent
to update the weights. The sum of squares error is:
X X
ErrorE (w ) = (val(e, Y ) − f ( wi × val(e, Xi )))2
e∈E i

The partial derivative with respect to weight wi is:


∂ErrorE (w ) X
= −2×δ ×f ′ ( wi ×val(e, Xi ))×val(e, Xi )
∂wi i

where δ = val(e, Y ) − pval w (e, Y ).


Thus, each example e updates each weight wi by
X
wi ← wi + η × δ × f ′ ( wi × val(e, Xi )) × val(e, Xi )
i

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 20
The sigmoid or logistic activation function
1
0.9 1
0.8
1 + e- x
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10 -5 0 5 10

1
f (x) =
1 + e −x

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 21
The sigmoid or logistic activation function
1
0.9 1
0.8
1 + e- x
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10 -5 0 5 10

1
f (x) =
1 + e −x

f ′ (x) = f (x)(1 − f (x))

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 22
Gradient Descent for Logistic Regression
1: procedure LinearLearner (X , Y , E , η)
2: Inputs
3: X : set of input features, X = {X1 , . . . , Xn }
4: Y : output feature
5: E : set of examples from which to learn
6: η: learning rate
7: initialize w0 , . . . , wn randomly
8: repeat
9: for each example P e in E do
10: p ← f ( i wi × val(e, Xi ))
11: δ ← val(e, Y ) − p
12: for each i ∈ [0, n] do
13: wi ← wi + ηδp(1 − p)val(e, Xi )
14: until some stopping criterion is true
15: return w0 , . . . , wn
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 23
Simple Example

new short home


-0.7 -0.9 1.2
0.4
reads
Ex new short home reads error
Predicted Obs
e1 0 0 0 f (0.4) = 0.6 0
e2 1 1 0 f (−1.2) = 0.23 0
e3 1 0 1 f (0.9) = 0.71 1

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 24
Simple Example

new short home


-0.7 -0.9 1.2
0.4
reads
Ex new short home reads error
Predicted Obs
e1 0 0 0 f (0.4) = 0.6 0 0.36
e2 1 1 0 f (−1.2) = 0.23 0 0.053
e3 1 0 1 f (0.9) = 0.71 1 0.084

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 25
Linearly Separable

A classification is linearly separable if there is a


hyperplane where the classification is true on one side of
the hyperplane and false on the other side.
For the sigmoid function, the hyperplane is when:
w0 + w1 × val(e, X1 ) + · · · + wn × val(e, Xn ) = 0.
If the data are linearly separable, the error can be made
arbitrarily small.
or and xor
1 + + 1 - + 1 + -

0 - + 0 - - 0 - +
0 1 0 1 0 1

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 26
Bayesian classifiers
Idea: if you knew the classification you could predict the
values of features.
P(Class|X1 . . . Xn ) ∝ P(X1 , . . . , Xn |Class)P(Class)
Naive Bayesian classifier: Xi are independent of each
other given the class.
Requires: P(Class) and P(Xi |Class) for each Xi .
Y
P(Class|X1 . . . Xn ) ∝ P(Xi |Class)P(Class)
i

UserAction

Author Thread Length Where Read

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 27
Learning Probabilities

C
X1 X2 X3 X4 C Count
.. .. .. .. .. ..
. . . . . .
t f t t 1 40 X1 X2 X3 X4
−→
t f t t 2 10
t f t t 3 50
.. .. .. .. .. ..
. . . . . .

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 28
Learning Probabilities

C
X1 X2 X3 X4 C Count
.. .. .. .. .. ..
. . . . . .
t f t t 1 40 X1 X2 X3 X4
−→
t f t t 2 10
t f t t 3 50
.. .. .. .. .. ..
. . . . . .
P
t|=C =vi Count(t)
P(C =vi ) = P
t Count(t)
P
t|=C =vi ∧Xk =vj Count(t)
P(Xk = vj |C =vi ) = P
t|=C =vi Count(t)
...perhaps including pseudo-counts
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 29
Help System

"able" "absent" "add" ... "zoom"

The domain of H is the set of all help pages.


The observations are the words in the query.
What probabilities are needed?
What pseudo-counts and counts are used?
What data can be used to learn from?
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 30
Planning

Planning is deciding what to do based on an agent’s


ability, its goals. and the state of the world.
Planning is finding a sequence of actions to solve a goal.
Initial assumptions:
◮ The world is deterministic.
◮ There are no exogenous events outside of the control of
the robot that change the state of the world.
◮ The agent knows what state it is in.
◮ Time progresses discretely from one state to the next.
◮ Goals are predicates of states that need to be achieved
or maintained.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 1
Actions

A deterministic action is a partial function from states to


states.
The preconditions of an action specify when the action
can be carried out.
The effect of an action specifies the resulting state.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 2
Delivery Robot Example

Coffee
Shop

Sam's
Office

Mail
Lab
Room

Actions:
Features:
mc – move clockwise
RLoc – Rob’s location
mcc – move counterclockwise
RHC – Rob has coffee
puc – pickup coffee
SWC – Sam wants coffee
dc – deliver coffee
MW – Mail is waiting
pum – pickup mail
RHM – Rob has mail
dm – deliver mail
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 3
Explicit State-space Representation

State

Action Resulting

State

lab, rhc, swc, mw , rhm mc
mr , rhc, swc, mw , rhm

lab, rhc, swc, mw , rhm mcc
off , rhc, swc, mw , rhm

off , rhc, swc, mw , rhm dm
off , rhc, swc, mw , rhm

off , rhc, swc, mw , rhm mcc
cs, rhc, swc, mw , rhm
off , rhc, swc, mw , rhm mc lab, rhc, swc, mw , rhm
... ... ...

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 4
Feature-based representation of actions

For each action:


precondition is a proposition that specifies when the
action can be carried out.
For each feature:
causal rules that specify when the feature gets a new
value and
frame rules that specify when the feature keeps its value.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 5
Example feature-based representation
Precondition of pick-up coffee (puc):

RLoc = cs ∧ rhc

Rules for location is cs:

RLoc ′ = cs ← RLoc = off ∧ Act = mcc


RLoc ′ = cs ← RLoc = mr ∧ Act = mc
RLoc ′ = cs ← RLoc = cs ∧ Act 6= mcc ∧ Act 6= mc

Rules for “robot has coffee”

rhc ′ ← rhc ∧ Act 6= dc


rhc ′ ← Act = puc

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 6
STRIPS Representation

For each action:


precondition that specifies when the action can be
carried out.
effect a set of assignments of values to features that are
made true by this action.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 7
Example STRIPS representation

Pick-up coffee (puc):


precondition: [cs, rhc]
effect: [rhc]
Deliver coffee (dc):
precondition: [off , rhc]
effect: [rhc, swc]

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.1, Page 8
Planning

Given:
A description of the effects and preconditions of the
actions
A description of the initial state
A goal to achieve
find a sequence of actions that is possible and will result in a
state satisfying the goal.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.2, Page 1
Forward Planning

Idea: search in the state-space graph.


The nodes represent the states
The arcs correspond to the actions: The arcs from a state
s represent all of the actions that are legal in state s.
A plan is a path from the state representing the initial
state to a state that satisfies the goal.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.2, Page 2
Example state-space graph

Actions
mc: move clockwise
mac: move anticlockwise 〈cs,rhc,swc,mw,rhm〉
nm: no move puc
mac
puc: pick up coffee mc
dc: deliver coffee
pum: pick up mail 〈cs,rhc,swc,mw,rhm〉 〈off,rhc,swc,mw,rhm〉 〈mr,rhc,swc,mw,rhm〉
dm: deliver mail mc
mc mac
〈off,rhc,swc,mw,rhm〉 mac 〈lab,rhc,swc,mw,rhm〉
〈cs,rhc,swc,mw,rhm〉

〈mr,rhc,swc,mw,rhm〉 Locations:
dc
cs: coffee shop Feature values
mc
〈off,rhc,swc,mw,rhm〉 mac off: office rhc: robot has coffee
lab: laboratory swc: Sam wants coffee
mac mr: mail room
〈lab,rhc,swc,mw,rhm〉 mw: mail waiting
rhm: robot has mail
〈mr,rhc,swc,mw,rhm〉

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.2, Page 3
What are the errors?

Actions
mc: move clockwise
1 〈
mac: move anticlockwise mr,rhc,swc,mw,rhm〉
nm: no move pum
puc
puc: pick up coffee 2 3 mc 4
dc: deliver coffee
pum: pick up mail 〈mr,rhc,swc,mw,rhm〉 〈cs,rhc,swc,mw,rhm〉 〈mr,rhc,swc,mw,rhm〉
dm: deliver mail
mac mc puc
〈lab,rhc,swc,mw,rhm〉 5 mc 〈off,rhc,swc,mw,rhm〉
〈cs,rhc,swc,mw,rhm〉
6
7
mc 〈cs,rhc,swc,mw,rhm〉 Locations:
cs: coffee shop Feature values
〈mr,rhc,swc,mw,rhm〉 mac 10
off: office rhc: robot has coffee
8 lab: laboratory swc: Sam wants coffee
puc
〈off,rhc,swc,mw,rhm〉 mr: mail room mw: mail waiting
rhm: robot has mail
9
11
〈cs,rhc,swc,mw,rhm〉

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.2, Page 4
Forward planning representation

The search graph can be constructed on demand: you


only construct reachable states.
If you want a cycle check or multiple path-pruning, you
need to be able to find repeated states.
There are a number of ways to represent states:
◮ As a specification of the value of every feature
◮ As a path from the start state

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.2, Page 5
Improving Search Efficiency

Forward search can use domain-specific knowledge specified


as:
a heuristic function that estimates the number of steps to
the goal
domain-specific pruning of neighbors:
◮ don’t go to the coffee shop unless “Sam wants coffee” is
part of the goal and Rob doesn’t have coffee
◮ don’t pick-up coffee unless Sam wants coffee
◮ unless the goal involves time constraints, don’t do the
“no move” action.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.2, Page 6
Regression Planning

Idea: search backwards from the goal description: nodes


correspond to subgoals, and arcs to actions.
Nodes are propositions: a formula made up of
assignments of values to features
Arcs correspond to actions that can achieve one of the
goals
Neighbors of a node N associated with arc A specify what
must be true immediately before A so that N is true
immediately after.
The start node is the goal to be achieved.
goal(N) is true if N is a proposition that is true of the
initial state.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 1
Defining nodes and arcs

A node N can be represented as a set of assignments of


values to variables:

[X1 = v1 , . . . , Xn = vn ]

This is a set of assignments you want to hold.


The last action is one that achieves one of the Xi = vi ,
and does not achieve Xj = vj′ where vj′ is different to vj .
The neighbor of N along arc A must contain:
◮ The prerequisites of action A
◮ All of the elements of N that were not achieved by A
N must be consistent.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 2
Formalizing arcs using STRIPS notation

hG , A, Ni

where G is [X1 = v1 , . . . , Xn = vn ] is an arc if


∃i Xi = vi is on the effects list of action A
∀j Xj = vj′ is not on the effects list for A, where vj′ 6= vj
N is preconditions(A) ∪ {Xk = vk : Xk = vk ∈ / effects(A)}
and N is consistent in that it does not assign different
values to any variable.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 3
Regression example

Actions Locations:
mc: move clockwise [swc] cs: coffee shop Feature values
mac: move anticlockwise off: office rhc: robot has coffee
puc: pick up coffee dc lab: laboratory swc: Sam wants coffee
dc: deliver coffee mr: mail room mw: mail waiting
pum: pick up mail [off,rhc] rhm: robot has mail
dm: deliver mail
mc mac
[cs,rhc] [lab,rhc]
mc mac mac
puc mc
[mr,rhc] [off,rhc] [mr,rhc]

[cs] [off,rhc]

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 4
Find the errors

[swc,rhc,mw]
pum dc puc

② ③
[swc,rhc,off] [off,rhc,mw] [swc,cs,mw]
mc dm
mc ⑤ ⑥
puc
[cs,rhc, mw] [off,rhc, rhm,mw]

[swc,rhc,cs] [swc,cs,mw]
⑧ Actions

mc: move clockwise
Locations: Feature values mac: move anticlockwise
cs: coffee shop rhc: robot has coffee puc: pick up coffee
off: office swc: Sam wants coffee dc: deliver coffee
lab: laboratory mw: mail waiting pum: pick up mail
mr: mail room rhm: robot has mail dm: deliver mail

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 5
Loop detection and multiple-path pruning

Goal G1 is simpler than goal G2 if G1 is a subset of G2 .


◮ It is easier to solve [cs] than [cs, rhc].
If you have a path to node N have already found a path
to a simpler goal, you can prune the path N.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 6
Improving Efficiency

You can define a heuristic function that estimates how


difficult it is to solve the goal from the initial state.
You can use domain-specific knowledge to remove
impossible goals.
◮ It is often not obvious from an action description to
conclude that an agent can only hold one item at any
time.

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 7
Comparing forward and regression planners

Which is more efficient depends on:


◮ The branching factor
◮ How good the heuristics are
Forward planning is unconstrained by the goal (except as
a source of heuristics).
Regression planning is unconstrained by the initial state
(except as a source of heuristics)

c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 8.3, Page 8

You might also like