Learning and Planning

CS 240
Learning (Machine Learning) &

Planning
•Intro. to ML Emdad Khan 1

1
Machine Learning: Short Introduction
• Learning Agent: An agent that improves its
performance through learning over time (as you have
seen in the first few lectures).
• Learning is very important for most Agents.

Accordingly, a new filed called “Machine Learning” is
born.
• What is Machine Learning?

– Basically learning the behavior, features or characteristics
from the data itself, instead of writing algorithms
directly that can model the data.

– a branch of artificial intelligence, is a scientific discipline

concerned with the design and development of algorithms
that take as input empirical data, such as that from
sensors or databases, and yield patterns or predictions
thought to be features of the underlying mechanism that
generated the data.

• Why Machine Learning?
– In this Information Age, many applications are highly data driven
with large amount of data – NLU, Imaging, Medical Applications,
Mining, Weather, Information Retrieval, Search….
– If Data had a Mass, our earth would be so heavy that it would

become a Black Hole!
– Very much needed when we have large amount of data that is very
hard (or sometimes impossible) to model with trying to write
algorithms directly.
– NOTE: We are still writing algorithms BUT TO HOW to LEARN

from Data instead of observe the data and come up with algorithms
to model it (or writing algorithms by observing system behavior).

• Types of Machine Learning
– Supervised Learning : a training set of examples with correct
response.
– Unsupervised Learning: a training set of examples without any

correct response (inputs having something in common are
categorized together)
– Reinforcement Learning: (betn. Supervised and Unsupervised) The

algorithm gets told when the answer is wrong but does NOT tell how
to correct it. Also known “Learning with Critic”.
– Evolutionary Learning: (based on biological evolution) Biological

organisms adapt to improve their survival rates and chance to having
offspring in their environment.

• Types of Algorithms (High Level)
– Statistical
– Non-statistical (Neural Networks– e.g. Back Propagation -finding

optimal value(s) by solving some differential equations)
– Evolutionary Learning / Genetic Algorithms
The Rest of the slides show more details and are from “Artificial
Intelligence by David Poole et al”, the 2nd Text Book.

Characterizations of Learning
Find the best representation given the data.

Delineate the class of consistent representations given the
data.
Find a probability distribution of the representations given
the data.
c
D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.1, Page 14
Learning
Learning is the ability to improve one’s behavior based on

experience.
The range of behaviors is expanded: the agent can do
more.
The accuracy on tasks is improved: the agent can do
things better.
The speed is improved: the agent can do things faster.
c
Components of a learning problem
The following components are part of any learning problem:

task The behavior or task that’s being improved.
For example: classification, acting in an environment
data The experiences that are being used to improve
performance in the task.
measure of improvement How can the improvement be
measured?
For example: increasing accuracy in prediction, new skills
that were not present initially, improved speed.
c
Learning task
Experiences/
Problem/
Data
Task
Model(s)
Learner Reasoner
Background knowledge/ Answer/

Bias Performance
c
Learning architecture
Experiences/
Problem/
Data
Task
Model(s)
Learner Reasoner
Background knowledge/ Answer/

Bias Performance
c
Choosing a representation
The richer the representation, the more useful it is for

subsequent problem solving.
The richer the representation, the more difficult it is to
learn.
c
Common Learning Tasks
Supervised classification Given a set of pre-classified

training examples, classify a new instance.
Unsupervised learning Find natural classes for examples.
Reinforcement learning Determine what to do based on
rewards and punishments.
Analytic learning Reason faster using experience.
Inductive logic programming Build richer models in
terms of logic programs.
Statistical relational learning learning relational
representations that also deal with uncertainty.
c
Example Classification Data
Training Examples:
Action Author Thread Length Where
e1 skips known new long home
e2 reads unknown new short work
e3 skips unknown old long work
e4 skips known old long home
e5 reads known new short home
e6 skips known old long work
New Examples:
e7 ??? known new short work
e8 ??? unknown new short work
We want to classify new examples on feature Action based on

the examples’ Author , Thread, Length, and Where.
c
Feedback
Learning tasks can be characterized by the feedback given to

the learner.
Supervised learning What has to be learned is specified
for each example.
Unsupervised learning No classifications are given; the
learner has to discover categories and regularities in the
data.
Reinforcement learning Feedback occurs after a
sequence of actions.
c
Measuring Success
The measure of success is not how well the agent

performs on the training examples, but how well the
agent performs for new examples.
Consider two agents:
◮ P claims the negative examples seen are the only
negative examples. Every other instance is positive.
◮ N claims the positive examples seen are the only
positive examples. Every other instance is negative.
Both agents correctly classify every training example, but
disagree on every other example.
c
Bias
The tendency to prefer one hypothesis over another is

called a bias.
Saying a hypothesis is better than N’s or P’s hypothesis
isn’t something that’s obtained from the data.
To have any inductive process make predictions on unseen
data, you need a bias.
What constitutes a good bias is an empirical question
about which biases work best in practice.
c
Learning as search
Given a representation and a bias, the problem of learning

can be reduced to one of search.
Learning is search through the space of possible
representations looking for the representation or
representations that best fits the data, given the bias.
These search spaces are typically prohibitively large for
systematic search. E.g., use gradient descent.
A learning algorithm is made of a search space, an
evaluation function, and a search method.
c
Noise
Data isn’t perfect:

◮ some of the attributes are assigned the wrong value
◮ the attributes given are inadequate to predict the
classification
◮ there are examples with missing attributes
overfitting occurs when distinctions appear in the
training data, but not in the unseen examples.
c
Errors in learning
Errors in learning are caused by:

Limited representation (representation bias)
Limited search (search bias)
Limited data (variance)
Limited features (noise)
c
Characterizations of Learning
Find the best representation given the data.

Delineate the class of consistent representations given the
data.
Find a probability distribution of the representations given
the data.
c
Supervised Learning
Given:
a set of inputs features X1 , . . . , Xn
a set of target features Y1 , . . . , Yk
a set of training examples where the values for the input
features and the target features are given for each
example
a new example, where only the values for the input
features are given
predict the values for the target features for the new example.
c
Supervised Learning
Given:
a set of inputs features X1 , . . . , Xn
a set of target features Y1 , . . . , Yk
a set of training examples where the values for the input
features and the target features are given for each
example
a new example, where only the values for the input
features are given
predict the values for the target features for the new example.
classification when the Yi are discrete
regression when the Yi are continuous
c
Example Data Representations
A travel agent wants to predict the preferred length of a trip,
which can be from 1 to 6 days. (No input features).
Two representations of the same data:
— Y is the length of trip chosen.
— Each Yi is an indicator variable that has value 1 if the
chosen length is i, and is 0 otherwise.
Example Y Example Y1 Y2 Y3 Y4 Y5 Y6
e1 1 e1 1 0 0 0 0 0
e2 6 e2 0 0 0 0 0 1
e3 6 e3 0 0 0 0 0 1
e4 2 e4 0 1 0 0 0 0
e5 1 e5 1 0 0 0 0 0
What is a prediction?
c
Evaluating Predictions
Suppose we want to make a prediction of a value for a target

feature on example e:
oe is the observed value of target feature on example e.
pe is the predicted value of target feature on example e.
The error of the prediction is a measure of how close pe
is to oe .
There are many possible errors that could be measured.
Sometimes pe can be a real number even though oe can only
have a few values.
c
Measures of error
E is the set of examples, with single target feature. For e ∈ E ,

oe is observed value and pe is predicted value:
X
absolute error L1 (E ) = |oe − pe |
e∈E
c
Measures of error

X
e∈E
X
sum of squares error L22 (E ) = (oe − pe )2
e∈E
c
Measures of error

X
e∈E
X
e∈E
worst-case error : L∞ (E ) = max |oe − pe |
e∈E
c
Measures of error

X
e∈E
X
e∈E
e∈E
number wrong : L0 (E ) = #{e : oe 6= pe }
c
Measures of error

X
e∈E
X
e∈E
e∈E
number wrong : L0 (E ) = #{e : oe 6= pe }

A cost-based error takes into account costs of errors.
c
Measures of error (cont.)
With binary feature: oe ∈ {0, 1}:

likelihood of the data
Y
peoe (1 − pe )(1−oe )
e∈E
c
Measures of error (cont.)
With binary feature: oe ∈ {0, 1}:

likelihood of the data
Y
peoe (1 − pe )(1−oe )
e∈E
entropy (number of bits to encode the data given a code

based on pval)
X
− (oe log pe + (1 − oe ) log(1 − pe ))
e∈E
c
Information theory overview
A bit is a binary digit.

1 bit can distinguish 2 items
k bits can distinguish 2k items
n items can be distinguished using log2 n bits
Can we do better?
c
Information and Probability
Let’s design a code to distinguish elements of {a, b, c, d} with
1 1 1 1
P(a) = , P(b) = , P(c) = , P(d) =
2 4 8 8
Consider the code:
a 0 b 10 c 110 d 111
This code sometimes uses 1 bit and sometimes uses 3 bits. On

average, it uses
P(a) × 1 + P(b) × 2 + P(c) × 3 + P(d) × 3

1 2 3 3 3
= + + + = 1 bits.
2 4 8 8 4
The string aacabbda has code 00110010101110.
c
Information Content
To identify x, we need − log2 P(x) bits.
Give a distribution over a set, to a identify a member, the
expected number of bits
X
−P(x) × log2 P(x).
x
is the information content or entropy of the

distribution.
The expected number of bits it takes to describe a
distribution given evidence e:
X
I (e) = −P(x|e) × log2 P(x|e).
x
c
Information Gain
Given a test that can distinguish the cases where α is true

from the cases where α is false, the information gain from
this test is:
I (true) − (P(α) × I (α) + P(¬α) × I (¬α)).
I (true) is the expected number of bits needed before the

test
P(α) × I (α) + P(¬α) × I (¬α) is the expected number of
bits after the test.
c
Linear Predictions
8
7 L∞
6
5
4
3
2
1
0
0 1 2 3 4 5
c
Linear Predictions
8
7 L∞
6 L22
5
4
3 L1
2
1
0
0 1 2 3 4 5
c
Point Estimates
To make a single prediction for feature Y , with examples E .
The prediction that minimizes the sum of squares error on
E is
c
Point Estimates
E is the mean (average) value of Y .
c
Point Estimates
The prediction that minimizes the absolute error on E is
c
Point Estimates
the median value of Y .
c
Point Estimates
The prediction that minimizes the number wrong on E is
c
Point Estimates
the mode of Y .
c
Point Estimates
the mode of Y .
The prediction that minimizes the worst-case error on E
is
c
Point Estimates
the mode of Y .
is (maximum + minimum)/2
c
Point Estimates
the mode of Y .
When Y has domain {0, 1}, the prediction that
maximizes the likelihood on E is
c
Point Estimates
the mode of Y .
maximizes the likelihood on E is the empirical probability.
c
Point Estimates
the mode of Y .
When Y has domain {0, 1}, the prediction that minimizes
the entropy on E is
c
Point Estimates
the mode of Y .
the entropy on E is the empirical probability.
c
Point Estimates
the mode of Y .
the entropy on E is the empirical probability.
But that doesn’t mean that these predictions minimize the
error for future predictions....
c
Training and Test Sets
To evaluate how well a learner will work on future predictions,

we divide the examples into:
training examples that are used to train the learner
test examples that are used to evaluate the learner
...these must be kept separate.
c
Learning Probabilities
Empirical probabilities do not make good predictors when
evaluated by likelihood or entropy.
Why?
c
Why? A probability of zero means “impossible” and has
infinite cost if there is one true case in test set.
c
Why? A probability of zero means “impossible” and has
infinite cost if there is one true case in test set.
Solution: (Laplace smoothing) add (non-negative)
pseudo-counts to the data.
Suppose ni is the number of examples with X = vi , and
ci is the pseudo-count:
ci + n i
P(X = vi ) = P
i ′ ci ′ + n i ′
Pseudo-counts convey prior knowledge. Consider: “how

much more would I believe vi if I had seen one example
with vi true than if I has seen no examples with vi true?”
c
Basic Models for Supervised Learning
Many learning algorithms can be seen as deriving from:

decision trees
linear (and non-linear) classifiers
Bayesian classifiers
c
Learning Decision Trees
Representation is a decision tree.

Bias is towards simple decision trees.
Search through the space of decision trees, from simple
decision trees to more complex ones.
c
Decision trees
A decision tree (for a particular output feature) is a tree

where:
Each nonleaf node is labeled with an input feature.
The arcs out of a node labeled with feature A are labeled
with each possible value of the feature A.
The leaves of the tree are labeled with point prediction of
the output feature.
c
Example Decision Trees
Length Length
long short long short
skips Thread skips reads with

probability 0.82
new follow_up
reads Author
unknown known
skips reads
c
Equivalent Logic Program
skips ← long .
reads ← short ∧ new .
reads ← short ∧ follow up ∧ known.
skips ← short ∧ follow up ∧ unknown.
or with negation as failure:
reads ← short ∧ new .

reads ← short ∧ ∼new ∧ known.
c
Issues in decision-tree learning
Given some training examples, which decision tree should

be generated?
A decision tree can represent any discrete function of the
input features.
You need a bias. Example, prefer the smallest tree.
Least depth? Fewest nodes? Which trees are the best
predictors of unseen data?
How should you go about building a decision tree? The
space of decision trees is too big for systematic search for
the smallest decision tree.
c
Searching for a Good Decision Tree
The input is a set of input features, a target feature and,

a set of training examples.
Either:
◮ Stop and return the a value for the target feature or a
distribution over target feature values
◮ Choose an input feature to split on.
For each value of this feature, build a subtree for those
examples with this value for the input feature.
c
Choices in implementing the algorithm
When to stop:
c
When to stop:
◮ no more input features
◮ all examples are classified the same
◮ too few examples to make an informative split
c
When to stop:
◮ no more input features
◮ all examples are classified the same
◮ too few examples to make an informative split
Which feature to select to split on isn’t defined. Often we
use myopic split: which single split gives smallest error.
With multi-valued features, we can split on all values or
split values into half.
c
Example: possible splits
skips 9
length reads 9
long short
skips 7 skips 2
reads 0 reads 9 skips 9
thread reads 9
new old
skips 3 skips 6
reads 7 reads 2
c
Handling Overfitting
This algorithm can overfit the data.

This occurs when noise and correlations in the training
set that are not reflected in the data as a whole.
To handle overfitting:
◮ restrict the splitting, and split only when the split is
useful.
◮ allow unrestricted splitting and prune the resulting tree
where it makes unwarranted distinctions.
◮ learn multiple trees and average them.
c
Linear Function
A linear function of features X1 , . . . , Xn is a function of the

form:
f w (X1 , . . . , Xn ) = w0 + w1 X1 + · · · + wn Xn
We invent a new feature X0 which has value 1, to make it not

a special case.
c
Linear Regression
Linear regression is where the output is a linear function of

the input features.
pval w (e, Y ) = w0 + w1 val(e, X1 ) + · · · + wn val(e, Xn )
c
Linear Regression
Linear regression is where the output is a linear function of

the input features.
pval w (e, Y ) = w0 + w1 val(e, X1 ) + · · · + wn val(e, Xn )
The sum of squares error on examples E for output Y is:

X
ErrorE (w ) = (val(e, Y ) − pval w (e, Y ))2
e∈E
X
= (val(e, Y )−(w0 +w1 val(e, X1 )+· · ·+wn val(e, Xn )))2
e∈E
Goal: find weights that minimize ErrorE (w ).
c
Finding weights that minimize ErrorE (w )
Find the minimum analytically.

Effective when it can be done (e.g., for linear regression).
c
Finding weights that minimize ErrorE (w )
Find the minimum analytically.

Effective when it can be done (e.g., for linear regression).
Find the minimum iteratively.
Works for larger classes of problems.
Gradient descent:
∂ErrorE (w )
wi ← wi − η
∂wi
η is the gradient descent step size, the learning rate.
c
Gradient Descent for Linear Regression
1: procedure LinearLearner (X , Y , E , η)
2: Inputs
3: X : set of input features, X = {X1 , . . . , Xn }
4: Y : output feature
5: E : set of examples from which to learn
6: η: learning rate
7: initialize w0 , . . . , wn randomly
8: repeat
9: for each example e in E do
10: δ ← val(e, Y ) − pval w (e, Y )
11: for each i ∈ [0, n] do
12: wi ← wi + ηδval(e, Xi )
13: until some stopping criterion is true
14: return w0 , . . . , wn
c
Linear Classifier
Assume we are doing binary classification, with classes

{0, 1} (e.g., using indicator functions).
There is no point in making a prediction of less than 0 or
greater than 1.
A squashed linear function is of the form:
f w (X1 , . . . , Xn ) = f (w0 + w1 X1 + · · · + wn Xn )
where f is an activation function .

A simple activation function is the step function:

1 if x ≥ 0
f (x) =
0 if x < 0
c
Gradient Descent for Linear Classifiers
If the activation is differentiable, we can use gradient descent
to update the weights. The sum of squares error is:
X X
ErrorE (w ) = (val(e, Y ) − f ( wi × val(e, Xi )))2
e∈E i
The partial derivative with respect to weight wi is:

∂ErrorE (w ) X
= −2×δ ×f ′ ( wi ×val(e, Xi ))×val(e, Xi )
∂wi i
where δ = val(e, Y ) − pval w (e, Y ).

Thus, each example e updates each weight wi by
X
wi ← wi + η × δ × f ′ ( wi × val(e, Xi )) × val(e, Xi )
i
c
The sigmoid or logistic activation function
1
0.9 1
0.8
1 + e- x
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10 -5 0 5 10
1
f (x) =
1 + e −x
c
The sigmoid or logistic activation function
1
0.9 1
0.8
1 + e- x
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10 -5 0 5 10
1
f (x) =
1 + e −x
f ′ (x) = f (x)(1 − f (x))
c
Gradient Descent for Logistic Regression
1: procedure LinearLearner (X , Y , E , η)
2: Inputs
3: X : set of input features, X = {X1 , . . . , Xn }
4: Y : output feature
5: E : set of examples from which to learn
6: η: learning rate
7: initialize w0 , . . . , wn randomly
8: repeat
9: for each example P e in E do
10: p ← f ( i wi × val(e, Xi ))
11: δ ← val(e, Y ) − p
12: for each i ∈ [0, n] do
13: wi ← wi + ηδp(1 − p)val(e, Xi )
14: until some stopping criterion is true
15: return w0 , . . . , wn
c
Simple Example
new short home

-0.7 -0.9 1.2
0.4
reads
Ex new short home reads error
Predicted Obs
e1 0 0 0 f (0.4) = 0.6 0
e2 1 1 0 f (−1.2) = 0.23 0
e3 1 0 1 f (0.9) = 0.71 1
c
Simple Example
new short home

-0.7 -0.9 1.2
0.4
reads
Ex new short home reads error
Predicted Obs
e1 0 0 0 f (0.4) = 0.6 0 0.36
e2 1 1 0 f (−1.2) = 0.23 0 0.053
e3 1 0 1 f (0.9) = 0.71 1 0.084
c
Linearly Separable
A classification is linearly separable if there is a

hyperplane where the classification is true on one side of
the hyperplane and false on the other side.
For the sigmoid function, the hyperplane is when:
w0 + w1 × val(e, X1 ) + · · · + wn × val(e, Xn ) = 0.
If the data are linearly separable, the error can be made
arbitrarily small.
or and xor
1 + + 1 - + 1 + -
0 - + 0 - - 0 - +
0 1 0 1 0 1
c
Bayesian classifiers
Idea: if you knew the classification you could predict the
values of features.
P(Class|X1 . . . Xn ) ∝ P(X1 , . . . , Xn |Class)P(Class)
Naive Bayesian classifier: Xi are independent of each
other given the class.
Requires: P(Class) and P(Xi |Class) for each Xi .
Y
P(Class|X1 . . . Xn ) ∝ P(Xi |Class)P(Class)
i
UserAction
Author Thread Length Where Read
c
C
X1 X2 X3 X4 C Count
.. .. .. .. .. ..
. . . . . .
t f t t 1 40 X1 X2 X3 X4
−→
t f t t 2 10
t f t t 3 50
.. .. .. .. .. ..
. . . . . .
c
C
X1 X2 X3 X4 C Count
.. .. .. .. .. ..
. . . . . .
t f t t 1 40 X1 X2 X3 X4
−→
t f t t 2 10
t f t t 3 50
.. .. .. .. .. ..
. . . . . .
P
t|=C =vi Count(t)
P(C =vi ) = P
t Count(t)
P
t|=C =vi ∧Xk =vj Count(t)
P(Xk = vj |C =vi ) = P
t|=C =vi Count(t)
...perhaps including pseudo-counts
c
Help System
"able" "absent" "add" ... "zoom"
The domain of H is the set of all help pages.

The observations are the words in the query.
What probabilities are needed?
What pseudo-counts and counts are used?
What data can be used to learn from?
c
Planning
Planning is deciding what to do based on an agent’s

ability, its goals. and the state of the world.
Planning is finding a sequence of actions to solve a goal.
Initial assumptions:
◮ The world is deterministic.
◮ There are no exogenous events outside of the control of
the robot that change the state of the world.
◮ The agent knows what state it is in.
◮ Time progresses discretely from one state to the next.
◮ Goals are predicates of states that need to be achieved
or maintained.
c
Actions
A deterministic action is a partial function from states to

states.
The preconditions of an action specify when the action
can be carried out.
The effect of an action specifies the resulting state.
c
Delivery Robot Example
Coffee
Shop
Sam's
Office
Mail
Lab
Room
Actions:
Features:
mc – move clockwise
RLoc – Rob’s location
mcc – move counterclockwise
RHC – Rob has coffee
puc – pickup coffee
SWC – Sam wants coffee
dc – deliver coffee
MW – Mail is waiting
pum – pickup mail
RHM – Rob has mail
dm – deliver mail
c
Explicit State-space Representation
State

Action Resulting

State

lab, rhc, swc, mw , rhm mc
mr , rhc, swc, mw , rhm

lab, rhc, swc, mw , rhm mcc
off , rhc, swc, mw , rhm

off , rhc, swc, mw , rhm dm
off , rhc, swc, mw , rhm

off , rhc, swc, mw , rhm mcc
cs, rhc, swc, mw , rhm
off , rhc, swc, mw , rhm mc lab, rhc, swc, mw , rhm
... ... ...
c
Feature-based representation of actions
For each action:

precondition is a proposition that specifies when the
action can be carried out.
For each feature:
causal rules that specify when the feature gets a new
value and
frame rules that specify when the feature keeps its value.
c
Example feature-based representation
Precondition of pick-up coffee (puc):
RLoc = cs ∧ rhc
Rules for location is cs:
RLoc ′ = cs ← RLoc = off ∧ Act = mcc

RLoc ′ = cs ← RLoc = mr ∧ Act = mc
RLoc ′ = cs ← RLoc = cs ∧ Act 6= mcc ∧ Act 6= mc
Rules for “robot has coffee”
rhc ′ ← rhc ∧ Act 6= dc

rhc ′ ← Act = puc
c
STRIPS Representation
For each action:

precondition that specifies when the action can be
carried out.
effect a set of assignments of values to features that are
made true by this action.
c
Example STRIPS representation
Pick-up coffee (puc):

precondition: [cs, rhc]
effect: [rhc]
Deliver coffee (dc):
precondition: [off , rhc]
effect: [rhc, swc]
c
Planning
Given:
A description of the effects and preconditions of the
actions
A description of the initial state
A goal to achieve
find a sequence of actions that is possible and will result in a
state satisfying the goal.
c
Forward Planning
Idea: search in the state-space graph.

The nodes represent the states
The arcs correspond to the actions: The arcs from a state
s represent all of the actions that are legal in state s.
A plan is a path from the state representing the initial
state to a state that satisfies the goal.
c
Example state-space graph
Actions
mc: move clockwise
mac: move anticlockwise 〈cs,rhc,swc,mw,rhm〉
nm: no move puc
mac
puc: pick up coffee mc
dc: deliver coffee
pum: pick up mail 〈cs,rhc,swc,mw,rhm〉〈off,rhc,swc,mw,rhm〉〈mr,rhc,swc,mw,rhm〉
dm: deliver mail mc
mc mac
〈off,rhc,swc,mw,rhm〉 mac 〈lab,rhc,swc,mw,rhm〉
〈cs,rhc,swc,mw,rhm〉
〈mr,rhc,swc,mw,rhm〉 Locations:
dc
cs: coffee shop Feature values
mc
〈off,rhc,swc,mw,rhm〉 mac off: office rhc: robot has coffee
lab: laboratory swc: Sam wants coffee
mac mr: mail room
〈lab,rhc,swc,mw,rhm〉 mw: mail waiting
rhm: robot has mail
〈mr,rhc,swc,mw,rhm〉
c
What are the errors?
Actions
mc: move clockwise
1 〈
mac: move anticlockwise mr,rhc,swc,mw,rhm〉
nm: no move pum
puc
puc: pick up coffee 2 3 mc 4
dc: deliver coffee
pum: pick up mail 〈mr,rhc,swc,mw,rhm〉〈cs,rhc,swc,mw,rhm〉〈mr,rhc,swc,mw,rhm〉
dm: deliver mail
mac mc puc
〈lab,rhc,swc,mw,rhm〉 5 mc 〈off,rhc,swc,mw,rhm〉
6
7
mc 〈cs,rhc,swc,mw,rhm〉 Locations:
cs: coffee shop Feature values
〈mr,rhc,swc,mw,rhm〉 mac 10
off: office rhc: robot has coffee
8 lab: laboratory swc: Sam wants coffee
puc
〈off,rhc,swc,mw,rhm〉 mr: mail room mw: mail waiting
rhm: robot has mail
9
11
c
Forward planning representation
The search graph can be constructed on demand: you

only construct reachable states.
If you want a cycle check or multiple path-pruning, you
need to be able to find repeated states.
There are a number of ways to represent states:
◮ As a specification of the value of every feature
◮ As a path from the start state
c
Improving Search Efficiency
Forward search can use domain-specific knowledge specified

as:
a heuristic function that estimates the number of steps to
the goal
domain-specific pruning of neighbors:
◮ don’t go to the coffee shop unless “Sam wants coffee” is
part of the goal and Rob doesn’t have coffee
◮ don’t pick-up coffee unless Sam wants coffee
◮ unless the goal involves time constraints, don’t do the
“no move” action.
c
Regression Planning
Idea: search backwards from the goal description: nodes

correspond to subgoals, and arcs to actions.
Nodes are propositions: a formula made up of
assignments of values to features
Arcs correspond to actions that can achieve one of the
goals
Neighbors of a node N associated with arc A specify what
must be true immediately before A so that N is true
immediately after.
The start node is the goal to be achieved.
goal(N) is true if N is a proposition that is true of the
initial state.
c
Defining nodes and arcs
A node N can be represented as a set of assignments of

values to variables:
[X1 = v1 , . . . , Xn = vn ]
This is a set of assignments you want to hold.

The last action is one that achieves one of the Xi = vi ,
and does not achieve Xj = vj′ where vj′ is different to vj .
The neighbor of N along arc A must contain:
◮ The prerequisites of action A
◮ All of the elements of N that were not achieved by A
N must be consistent.
c
Formalizing arcs using STRIPS notation
hG , A, Ni
where G is [X1 = v1 , . . . , Xn = vn ] is an arc if

∃i Xi = vi is on the effects list of action A
∀j Xj = vj′ is not on the effects list for A, where vj′ 6= vj
N is preconditions(A) ∪ {Xk = vk : Xk = vk ∈ / effects(A)}
and N is consistent in that it does not assign different
values to any variable.
c
Regression example
Actions Locations:
mc: move clockwise [swc] cs: coffee shop Feature values
mac: move anticlockwise off: office rhc: robot has coffee
puc: pick up coffee dc lab: laboratory swc: Sam wants coffee
dc: deliver coffee mr: mail room mw: mail waiting
pum: pick up mail [off,rhc] rhm: robot has mail
dm: deliver mail
mc mac
[cs,rhc] [lab,rhc]
mc mac mac
puc mc
[mr,rhc] [off,rhc] [mr,rhc]
[cs] [off,rhc]
c
Find the errors
①
[swc,rhc,mw]
pum dc puc
④
② ③
[swc,rhc,off] [off,rhc,mw] [swc,cs,mw]
mc dm
mc ⑤ ⑥
puc
[cs,rhc, mw] [off,rhc, rhm,mw]
[swc,rhc,cs] [swc,cs,mw]
⑧ Actions
⑦
mc: move clockwise
Locations: Feature values mac: move anticlockwise
cs: coffee shop rhc: robot has coffee puc: pick up coffee
off: office swc: Sam wants coffee dc: deliver coffee
lab: laboratory mw: mail waiting pum: pick up mail
mr: mail room rhm: robot has mail dm: deliver mail
c
Loop detection and multiple-path pruning
Goal G1 is simpler than goal G2 if G1 is a subset of G2 .

◮ It is easier to solve [cs] than [cs, rhc].
If you have a path to node N have already found a path
to a simpler goal, you can prune the path N.
c
Improving Efficiency
You can define a heuristic function that estimates how

difficult it is to solve the goal from the initial state.
You can use domain-specific knowledge to remove
impossible goals.
◮ It is often not obvious from an action description to
conclude that an agent can only hold one item at any
time.
c
Comparing forward and regression planners
Which is more efficient depends on:

◮ The branching factor
◮ How good the heuristics are
Forward planning is unconstrained by the goal (except as
a source of heuristics).
Regression planning is unconstrained by the initial state
(except as a source of heuristics)
c

Learning and Planning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning and Planning

Uploaded by

Copyright:

Available Formats

CS 240

Learning (Machine Learning) &

•Intro. to ML Emdad Khan 1

• Learning is very important for most Agents.

• What is Machine Learning?

•Intro. to ML Emdad Khan 2

– a branch of artificial intelligence, is a scientific discipline

•Intro. to ML Emdad Khan 3

– If Data had a Mass, our earth would be so heavy that it would

– NOTE: We are still writing algorithms BUT TO HOW to LEARN

•Intro. to ML Emdad Khan 4

– Unsupervised Learning: a training set of examples without any

– Reinforcement Learning: (betn. Supervised and Unsupervised) The

– Evolutionary Learning: (based on biological evolution) Biological

•Intro. to ML Emdad Khan 5

– Non-statistical (Neural Networks– e.g. Back Propagation -finding

– Evolutionary Learning / Genetic Algorithms

•Intro. to ML Emdad Khan 6

Find the best representation given the data.

Learning is the ability to improve one’s behavior based on

The following components are part of any learning problem:

Background knowledge/ Answer/

Background knowledge/ Answer/

The richer the representation, the more useful it is for

Supervised classification Given a set of pre-classified

We want to classify new examples on feature Action based on

Learning tasks can be characterized by the feedback given to

The measure of success is not how well the agent

The tendency to prefer one hypothesis over another is

Given a representation and a bias, the problem of learning

Data isn’t perfect:

Errors in learning are caused by:

Find the best representation given the data.

Suppose we want to make a prediction of a value for a target

E is the set of examples, with single target feature. For e ∈ E ,

E is the set of examples, with single target feature. For e ∈ E ,

E is the set of examples, with single target feature. For e ∈ E ,

E is the set of examples, with single target feature. For e ∈ E ,

number wrong : L0 (E ) = #{e : oe 6= pe }

E is the set of examples, with single target feature. For e ∈ E ,

number wrong : L0 (E ) = #{e : oe 6= pe }

With binary feature: oe ∈ {0, 1}:

With binary feature: oe ∈ {0, 1}:

entropy (number of bits to encode the data given a code

A bit is a binary digit.

This code sometimes uses 1 bit and sometimes uses 3 bits. On

P(a) × 1 + P(b) × 2 + P(c) × 3 + P(d) × 3

is the information content or entropy of the

Given a test that can distinguish the cases where α is true

I (true) − (P(α) × I (α) + P(¬α) × I (¬α)).

I (true) is the expected number of bits needed before the

To evaluate how well a learner will work on future predictions,

Pseudo-counts convey prior knowledge. Consider: “how

Many learning algorithms can be seen as deriving from:

Representation is a decision tree.

A decision tree (for a particular output feature) is a tree

long short long short

skips Thread skips reads with

or with negation as failure:

reads ← short ∧ new .

Given some training examples, which decision tree should

The input is a set of input features, a target feature and,

This algorithm can overfit the data.