Professional Documents
Culture Documents
com
om
Learning from Observations
.c
rs
ta
Unit 8
kS
oc
R
nt
de
tu
.S
w
w
w
www.StudentRockStars.com
www.StudentRockStars.com
Outline
♦ Learning agents
♦ Inductive learning
om
.c
♦ Decision tree learning
rs
ta
♦ Measuring learning performance
kS
oc
R
nt
de
tu
.S
w
w
w
Learning
Learning is essential for unknown environments,
i.e., when designer lacks omniscience
om
Learning is useful as a system construction method,
.c
rs
i.e., expose the agent to reality rather than trying to write it down
ta
kS
Learning modifies the agent’s decision mechanisms to improve performance
oc
om
www.StudentRockStars.c
R
nt
de
tu
.S
w
w
w
Learning agents
Performance standard
om
Critic Sensors
.c
rs
ta
feedback
kS
Environment
oc
changes
R
nt
Learning Performance
element de element
knowledge
tu
learning
.S
goals
w
w
experiments
Problem
w
generator
Agent Effectors
Learning element
Design of learning element is dictated by
♦ what type of performance element is used
om
♦ which functional component is to be learned
.c
♦ how that functional compoent is represented
rs
♦ what kind of feedback is available
ta
Example scenarios:
kS
oc
Performance element Component Representation Feedback
R
nt
Alpha−beta search Eval. fn. de Weighted linear function Win/loss
tu
Logical agent Transition model Successor−state axioms Outcome
.S
om
f is the target function
.c
O O X
rs
ta
An example is a pair x, f (x), e.g., X , +1
kS
X
oc
Problem: find a(n) hypothesis h
R
nt
such that h ≈ f de
given a training set of examples
tu
.S
om
E.g., curve fitting:
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
om
E.g., curve fitting:
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
om
E.g., curve fitting:
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
om
E.g., curve fitting:
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
om
E.g., curve fitting:
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
om
E.g., curve fitting: com
.c
o c k S tars.
ntR
rs
w w w .Stude
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous, etc.)
E.g., situations where I will/won’t wait for a table:
om
Attributes Target
.c
Example
Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait
rs
ta
X1 T F F T Some $$$ F T French 0–10 T
kS
X2 T F F T Full $ F F Thai 30–60 F
oc
X3 F T F F Some $ F F Burger 0–10 T
R
X4 T F T T Full $ F F Thai 10–30 T
nt
X5 T F T F Full $$$
de F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
tu
X7 F T F F None $ T F Burger 0–10 F
.S
Decision trees
One possible representation for hypotheses
E.g., here is the “true” tree for deciding whether to wait:
om
.c
Patrons?
rs
ta
None Some Full
kS
F T WaitEstimate?
oc
R
>60 30−60 10−30 0−10
nt
F Alternate? de Hungry? T
No Yes No Yes
tu
.S
Bar? T F T T Raining?
No Yes No Yes
F T F T
Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
om
A
.c
A B A xor B F T
rs
F F F
ta
B B
kS
F T T
F T F T
T F T
oc
T T F
R
F T T F
nt
de
Trivially, there is a consistent decision tree for any training set
tu
w/ one path to leaf for each example (unless f nondeterministic in x)
.S
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
= number of Boolean functions
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
= number of Boolean functions
= number of distinct truth tables with 2n rows
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
ta
kS
How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)??
oc
R
nt
de
tu
.S
w
w
w
Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
ta
kS
How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)??
oc
R
Each attribute can be in (positive), in (negative), or out
nt
⇒ 3n distinct conjunctive hypotheses
de
tu
om
Idea: (recursively) choose “most significant” attribute as root of (sub)tree
.c
rs
function DTL(examples, attributes, default) returns a decision tree
ta
if examples is empty then return default
kS
else if all examples have the same classification then return the classification
oc
else if attributes is empty then return Mode(examples)
R
else
nt
best ← Choose-Attribute(attributes, examples)
de
tree ← a new decision tree with root test best
tu
Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”
om
.c
rs
ta
kS
Patrons? Type?
oc
None Some Full French Italian Thai Burger
R
nt
de
tu
.S
Information
Information answers questions
The more clueless I am about the answer initially, the more information is
om
contained in the answer www.StudentRockStars.com
.c
rs
Scale: 1 bit = answer to Boolean question with prior h0.5, 0.5i
ta
kS
Information in an answer when prior is hP1, . . . , Pni is
oc
R
n
H(hP1, . . . , Pni) = Σi = 1 − Pi log2 Pi
nt
de
(also called entropy of the prior)
tu
.S
w
w
w
Information contd.
Suppose we have p positive and n negative examples at the root
⇒ H(hp/(p+n), n/(p+n)i) bits needed to classify a new example
om
E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit
.c
rs
An attribute splits the examples E into subsets Ei, each of which (we hope)
ta
needs less information to complete the classification
kS
oc
Let Ei have pi positive and ni negative examples
R
⇒ H(hpi/(pi +ni), ni/(pi +ni)i) bits needed to classify a new example
nt
⇒ expected number of bits per example over all branches is
de
tu
pi + n i
Σi H(hpi/(pi + ni), ni/(pi + ni)i)
.S
p+n
w
w
For P atrons?, this is 0.459 bits, for T ype this is (still) 1 bit
w
Example contd.
Decision tree learned from the 12 examples:
om
Patrons?
.c
None Some Full
rs
ta
F T Hungry?
kS
Yes No
oc
R
Type? F
nt
French Italian de Thai Burger
tu
T F Fri/Sat? T
.S
No Yes
w
w
F T
w
Substantially simpler than “true” tree—a more complex hypothesis isn’t jus-
tified by small amount of data
Performance measurement
How do we know that h ≈ f ? (Hume’s Problem of Induction)
1) Use theorems of computational/statistical learning theory
om
.c
2) Try h on a new test set of examples
rs
(use same distribution over example space as training set)
ta
kS
Learning curve = % correct on test set as a function of training set size
oc
1
R
% correct on test set
nt
0.9
de
0.8
tu
.S
0.7
w
0.6
w
w
0.5
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Chapter 18, Sections 1–3 28
www.StudentRockStars.com
www.StudentRockStars.com
om
non-realizability can be due to missing attributes
.c
or restricted hypothesis class (e.g., thresholded linear function)
rs
– redundant expressiveness (e.g., loads of irrelevant attributes)
ta
kS
% correct
oc
1
R
realizable
nt
de
redundant
tu
nonrealizable
.S
w
w
w
# of examples
Summary
Learning needed for unknown environments, lazy designers
om
Learning agent = performance element + learning element
.c
Learning method depends on type of performance element, available
rs
feedback, type of component to be improved, and its representation
ta
kS
For supervised learning, the aim is to find a simple hypothesis
oc
that is approximately consistent with training examples
R
nt
de
Decision tree learning using information gain
tu
www.StudentRockStars.com
w
w
om
Statistical learning
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
www.StudentRockStars.com
www.StudentRockStars.com
Outline
♦ Bayesian learning
♦ Maximum a posteriori and maximum likelihood learning
om
♦ Bayes net learning
.c
– ML parameter learning with complete data
rs
ta
– linear regression
kS
oc
R
nt
de
tu
.S
w
w
w
www.StudentRockStars.com
www.StudentRockStars.com
om
H is the hypothesis variable, values h1, h2, . . ., prior P(H)
.c
rs
jth observation dj gives the outcome of random variable Dj
ta
training data d = d1, . . . , dN www.StudentRockStars.com
kS
oc
Given the data so far, each hypothesis has a posterior probability:
R
nt
P (hi|d) = αP (d|hi)P (hi)de
tu
where P (d|hi) is called the likelihood
.S
w
www.StudentRockStars.com
www.StudentRockStars.com
Example
Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
om
40% are h3: 50% cherry candies + 50% lime candies
.c
20% are h4: 25% cherry candies + 75% lime candies
rs
ta
10% are h5: 100% lime candies
kS
oc
R
nt
de
tu
What kind of bag is it? What flavour will the next candy be?
w
w
www.StudentRockStars.com
www.StudentRockStars.com
1
P(h1 | d)
Posterior probability of hypothesis
P(h2 | d)
om
P(h3 | d)
0.8 P(h4 | d)
.c
P(h5 | d)
rs
ta
0.6
kS
oc
R
0.4
nt
de
tu
0.2
.S
w
w
0
w
0 2 4 6 8 10
Number of samples in d
www.StudentRockStars.com
www.StudentRockStars.com
Prediction probability
om
0.9
P(next candy is lime | d)
.c
rs
0.8
ta
kS
oc
0.7
R
nt
0.6 de
tu
.S
0.5
w
w
0.4
w
0 2 4 6 8 10
Number of samples in d
www.StudentRockStars.com
www.StudentRockStars.com
MAP approximation
Summing over the hypothesis space is often intractable
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)
om
Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
.c
rs
I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi)
ta
kS
Log terms can be viewed as (negative of)
oc
bits to encode data given hypothesis + bits to encode hypothesis
R
This is the basic idea of minimum description length (MDL) learning
nt
de
For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise
tu
⇒ MAP = simplest consistent hypothesis (cf. science)
.S
w
w
w
www.StudentRockStars.com
www.StudentRockStars.com
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi)
om
I.e., simply get the best fit to the data; identical to MAP for uniform prior
.c
rs
(which is reasonable if all hypotheses are of the same complexity)
ta
kS
ML is the “standard” (non-Bayesian) statistical learning method
oc
R
nt
de
tu
.S
w
w
w
www.StudentRockStars.com
www.StudentRockStars.com
om
Flavor
.c
Suppose we unwrap N candies, c cherries and ` = N − c limes
rs
These are i.i.d. (independent, identically distributed) observations, so
ta
kS
N
Y
P (d|hθ ) = P (dj |hθ ) = θc · (1 − θ)`
oc
j =1
R
nt
Maximize this w.r.t. θ—which is easier for the log-likelihood:
de
N
tu
X
L(d|hθ ) = log P (d|hθ ) = log P (dj |hθ ) = c log θ + ` log(1 − θ)
.S
j =1
w
dL(d|hθ ) c ` c c
w
= − =0 ⇒ θ= =
dθ θ 1−θ c+` N
w
www.StudentRockStars.com
www.StudentRockStars.com
Multiple parameters
Red/green wrapper depends probabilistically on flavor: P (F=cherry )
θ
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
om
P (F = cherry, W = green|hθ,θ1,θ2 )
.c
F P(W=red | F )
θ1
rs
= P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) cherry
θ2
ta
lime
= θ · (1 − θ1)
kS
oc
Wrapper
N candies, rc red-wrapped cherry candies, etc.:
R
nt
r
P (d|hθ,θ1,θ2 ) = θ c(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`
de
tu
www.StudentRockStars.com
.S
w
www.StudentRockStars.com
www.StudentRockStars.com
om
∂θ θ 1−θ c+`
.c
rs
∂L rc gc rc
ta
= − =0 ⇒ θ1 =
kS
∂θ1 θ1 1 − θ 1 r c + gc
oc
R
∂L r` g` r`
nt
= − =0 de ⇒ θ2 =
∂θ2 θ2 1 − θ 2 r ` + g`
tu
www.StudentRockStars.com
www.StudentRockStars.com
0.8
P(y |x)
4 0.6
om
3.5
y
3
2.5 0.4
2
.c
1.5
rs
1 1
0.5 0.8 0.2
ta
0 0.6
0 0.2 0.4 y
0.4 0.6
kS
0.2 0
x 0.8 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
oc
x
R
(y−(θ1 x+θ2 ))2
nt
1 −
Maximizing P (y|x) = √ e 2σ 2 de w.r.t. θ1, θ2
2πσ
tu
.S
N
X
= minimizing E = (yj − (θ1xj + θ2))2
w
j =1
w
w
That is, minimizing the sum of squared errors gives the ML solution
for a linear fit assuming Gaussian noise of fixed variance
www.StudentRockStars.com
www.StudentRockStars.com
Summary
Full Bayesian learning gives best possible predictions but is intractable
MAP learning balances complexity with accuracy on training data
om
Maximum likelihood assumes uniform prior, OK for large data sets
.c
rs
1. Choose a parameterized family of models to describe the data
ta
kS
requires substantial insight and sometimes new models
oc
2. Write down the likelihood of the data as a function of the parameters
R
nt
may require summing over hidden variables, i.e., inference
tu
de
3. Write down the derivative of the log likelihood w.r.t. each parameter
.S
w
4. Find the parameter values such that the derivatives are zero
w
www.StudentRockStars.com
www.StudentRockStars.com
www.StudentRockStars.com
om
Neural networks
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Outline
♦ Brains
♦ Neural networks
om
.c
♦ Perceptrons
rs
ta
♦ Multilayer perceptrons
kS
oc
♦ Applications of neural networks
R
nt
de
tu
.S
w
w
w
Brains
1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time
Signals are noisy “spike trains” of electrical potential
om
.c
rs
Axonal arborization
ta
kS
Axon from another cell
oc
Synapse
R
nt
Dendrite Axon
de
tu
.S
Nucleus
w
w
Synapses
w
McCulloch–Pitts “unit”
Output is a “squashed” linear function of the inputs:
ai ← g(ini) = g Σj Wj,iaj
om
.c
Bias Weight
a0 = −1
rs
W0,i ai = g(ini)
ta
kS
g
ini
oc
Σ
Wj,i
aj ai
R
nt
de
tu
Input Input Activation Output
Output
.S
Activation functions
g(ini) g(ini)
om
+1 +1
.c
rs
ta
kS
oc
R
ini ini
nt
(a) de (b)
tu
.S
om
W1 = 1 W1 = 1
.c
W1 = –1
rs
W2 = 1 W2 = 1
ta
kS
AND OR NOT
oc
McCulloch and Pitts: every Boolean function can be implemented
R
nt
de
tu
.S
w
w
w
Network structures
Feed-forward networks:
– single-layer perceptrons
om
– multi-layer perceptrons
.c
rs
Feed-forward networks implement functions, have no internal state
ta
kS
Recurrent networks:
oc
– Hopfield networks have symmetric weights (Wi,j = Wj,i)
R
g(x) = sign(x), ai = ± 1; holographic associative memory
nt
– Boltzmann machines use stochastic activation functions,
de
≈ MCMC in Bayes nets
tu
.S
www.StudentRockStars.com
Feed-forward example
W1,3
1 3
om
W3,5
W1,4
.c
rs
ta
5
kS
oc
W2,3 W4,5
R
nt
2 4
W2,4 de
tu
.S
Single-layer perceptrons
om
.c
Perceptron output
rs
1
ta
0.8
kS
0.6
0.4
oc
0.2 4
2
R
0 0 x2
-4 -2 -2
Output
nt
Input x1
0 2 -4
Wj,i de 4
Units Units
tu
.S
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
om
Can represent AND, OR, NOT, majority, etc., but not XOR
.c
Represents a linear separator in input space:
rs
ta
Σj Wj xj > 0 or W · x > 0
kS
oc
x1 x1 x1
R
nt
1 1 1
de
tu
?
.S
w
0 0 0
w
0 1 x2 0 1 x2 0 1 x2
w
Perceptron learning
Learn by adjusting weights to reduce error on training set
om
The squared error for an example with input x and true output y is
.c
1 2 1
E = Err ≡ (y − hW(x))2 ,
rs
2 2
ta
kS
Perform optimization search by gradient descent:
oc
∂E ∂Err ∂
R
n
= Err × = Err × y − g(Σj = 0Wj xj )
nt
∂Wj ∂Wj ∂Wj de
= −Err × g 0(in) × xj
tu
.S
Wj ← Wj + α × Err × g 0(in) × xj
w
om
.c
Proportion correct on test set
rs
1 1
ta
0.9 0.9
kS
0.8 0.8
oc
0.7 0.7
R
0.6 Perceptron 0.6
nt
Decision tree de
0.5 0.5 Perceptron
Decision tree
tu
0.4 0.4
.S
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
w
Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data
w
w
Multilayer perceptrons
Layers are usually fully connected;
numbers of hidden units typically chosen by hand
om
Output units ai
.c
rs
ta
Wj,i
kS
oc
R
Hidden units aj
nt
de
Wk,j
tu
.S
w
w
Input units ak
w
Expressiveness of MLPs
All continuous functions w/ 2 layers, all functions w/ 3 layers
om
.c
hW(x1, x2) hW(x1, x2)
rs
1 1
ta
0.8 0.8
kS
0.6 0.6
0.4 0.4
oc
0.2 4 0.2 4
R
0 2 0 2
-4 -2 0 x2 -4 -2 0 x2
-2 -2
nt
0 2 -4 0 2 -4
x1 4 x1 4
de
tu
.S
Back-propagation learning
Output layer: same as for single-layer perceptron,
om
Wj,i ← Wj,i + α × aj × ∆i
.c
where ∆i = Err i × g 0(in i)
rs
ta
Hidden layer: back-propagate the error from the output layer:
kS
oc
X
∆j = g 0(in j ) Wj,i∆i .
R
i
nt
Update rule for weights in hidden layer:de
tu
Wk,j ← Wk,j + α × ak × ∆j .
.S
w
www.StudentRockStars.com
Back-propagation derivation
The squared error on a single example is defined as
1X
om
E= (yi − ai)2 ,
2 i
.c
rs
where the sum is over the nodes in the output layer.
ta
kS
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
oc
∂Wj,i ∂Wj,i ∂Wj,i
R
∂in i ∂ X
nt
= −(yi − ai)g 0(in i) de= −(yi − ai)g 0(in i)
Wj,iaj
∂Wj,i ∂Wj,i j
= −(yi − ai)g 0(in i)aj = −aj ∆i
tu
.S
w
w
w
∂E X ∂ai X ∂g(in i)
om
= − (yi − ai) = − (yi − ai)
∂Wk,j i ∂Wk,j i ∂Wk,j
.c
∂in ∂
rs
X i X X
= − (yi − ai)g 0(in i) = − ∆i Wj,iaj
ta
i ∂Wk,j i ∂Wk,j j
kS
X ∂aj X ∂g(in j )
= − ∆iWj,i = − ∆iWj,i
oc
i ∂Wk,j i ∂Wk,j
R
∂in j
nt
X
= − ∆iWj,ig 0(in j ) de
i ∂Wk,j
tu
X ∂ X
= − ∆iWj,ig 0(in j )
.S
Wk,j ak
∂Wk,j k
w
i
w
X
= − ∆iWj,ig 0(in j )ak = −ak ∆j
w
om
Training curve for 100 restaurant examples: finds exact fit
.c
14
Total error on training set
rs
12
ta
kS
10
oc
8
R
6
nt
4 de
tu
2
.S
0
w
Number of epochs
w
om
1
Proportion correct on test set
.c
0.9
rs
ta
0.8
kS
0.7
oc
R
0.6 Decision tree
nt
Multilayer network
0.5 de
tu
0.4
.S
0 10 20 30 40 50 60 70 80 90 100
w
w
om
.c
rs
ta
kS
oc
3-nearest-neighbor = 2.4% error
R
400–300–10 unit MLP = 1.6% error
nt
LeNet: 768–192–30–10 unit MLP = 0.9% error
de
tu
Current best (kernel machines, vision algorithms) ≈ 0.6% error
.S
w
w
w
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
om
Perceptrons (one-layer networks) insufficiently expressive
www.StudentRockStars.com
.c
Multi-layer networks are sufficiently expressive; can be trained by gradient
rs
descent, i.e., error back-propagation
ta
kS
Many applications: speech, driving, handwriting, fraud detection, etc.
oc
R
Engineering, cognitive modelling, and neural system modelling
nt
subfields have largely diverged de
tu
.S
w
w
w