Learning from Observations: Inductive Learning and Decision Trees

www.StudentRockStars.
com
om
Learning from Observations
.c
rs
ta
Unit 8
kS
oc
R
nt
de
tu
.S
w
w
w
www.StudentRockStars.com
Outline
♦ Learning agents
♦ Inductive learning
om
.c
♦ Decision tree learning
rs
ta
♦ Measuring learning performance
kS
oc
R
nt
de
tu
.S
w
w
w
Chapter 18, Sections 1–3 2

Learning
Learning is essential for unknown environments,
i.e., when designer lacks omniscience
om
Learning is useful as a system construction method,
.c
rs
i.e., expose the agent to reality rather than trying to write it down
ta
kS
Learning modifies the agent’s decision mechanisms to improve performance
oc
om
www.StudentRockStars.c
R
nt
de
tu
.S
w
w
w

Learning agents
Performance standard
om
Critic Sensors
.c
rs
ta
feedback
kS
Environment
oc
changes
R
nt
Learning Performance
element de element
knowledge
tu
learning
.S
goals
w
w
experiments
Problem
w
generator
Agent Effectors

Learning element
Design of learning element is dictated by
♦ what type of performance element is used
om
♦ which functional component is to be learned
.c
♦ how that functional compoent is represented
rs
♦ what kind of feedback is available
ta
Example scenarios:
kS
oc
Performance element Component Representation Feedback
R
nt
Alpha−beta search Eval. fn. de Weighted linear function Win/loss
tu
Logical agent Transition model Successor−state axioms Outcome
.S
Utility−based agent Transition model Dynamic Bayes net Outcome

w
w
Simple reflex agent Percept−action fn Neural net Correct action

w
Supervised learning: correct answers for each instance

Reinforcement learning: occasional rewards

Inductive learning (a.k.a. Science)
Simplest form: learn a function from examples (tabula rasa)
om
f is the target function
.c
O O X
rs
ta
An example is a pair x, f (x), e.g., X , +1
kS
X
oc
Problem: find a(n) hypothesis h
R
nt
such that h ≈ f de
given a training set of examples
tu
.S
(This is a highly simplified model of real learning:

w
– Ignores prior knowledge

w
w
– Assumes a deterministic, observable “environment”

– Assumes examples are given
– Assumes that the agent wants to learn f —why?)

Inductive learning method

Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
om
E.g., curve fitting:
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w


om
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w


om
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w


om
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w


om
.c
rs
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w


om
E.g., curve fitting: com
.c
o c k S tars.
ntR
rs
w w w .Stude
f(x)
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Ockham’s razor: maximize a combination of consistency and simplicity

Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous, etc.)
E.g., situations where I will/won’t wait for a table:
om
Attributes Target
.c
Example
Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait
rs
ta
X1 T F F T Some $$$ F T French 0–10 T
kS
X2 T F F T Full $ F F Thai 30–60 F
oc
X3 F T F F Some $ F F Burger 0–10 T
R
X4 T F T T Full $ F F Thai 10–30 T
nt
X5 T F T F Full $$$
de F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
tu
X7 F T F F None $ T F Burger 0–10 F
.S
X8 F F F T Some $$ T T Thai 0–10 T

w
X9 F T T F Full $ T F Burger >60 F

w
w
X10 T T T T Full $$$ F T Italian 10–30 F

X11 F F F F None $ F F Thai 0–10 F
X12 T T T T Full $ F F Burger 30–60 T
Classification of examples is positive (T) or negative (F)

Decision trees
One possible representation for hypotheses
E.g., here is the “true” tree for deciding whether to wait:
om
.c
Patrons?
rs
ta
None Some Full
kS
F T WaitEstimate?
oc
R
>60 30−60 10−30 0−10
nt
F Alternate? de Hungry? T
No Yes No Yes
tu
.S
Reservation? Fri/Sat? T Alternate?

w
No Yes No Yes No Yes

w
w
Bar? T F T T Raining?
No Yes No Yes
F T F T

Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
om
A
.c
A B A xor B F T
rs
F F F
ta
B B
kS
F T T
F T F T
T F T
oc
T T F
R
F T T F
nt
de
Trivially, there is a consistent decision tree for any training set
tu
w/ one path to leaf for each example (unless f nondeterministic in x)
.S
but it probably won’t generalize to new examples

w
w
w
Prefer to find more compact decision trees

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??
om
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Hypothesis spaces
om
= number of Boolean functions
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Hypothesis spaces
om
= number of distinct truth tables with 2n rows
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Hypothesis spaces
om
n
= number of distinct truth tables with 2n rows = 22
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Hypothesis spaces
om
n
.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Hypothesis spaces
om
n
.c
rs
ta
kS
How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)??
oc
R
nt
de
tu
.S
w
w
w

Hypothesis spaces
om
n
.c
rs
ta
kS
How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)??
oc
R
Each attribute can be in (positive), in (negative), or out
nt
⇒ 3n distinct conjunctive hypotheses
de
tu
More expressive hypothesis space www.StudentRockStars.com

.S
w
– increases chance that target function can be expressed

w
– increases number of hypotheses consistent w/ training set

w
⇒ may get worse predictions

Decision tree learning

Aim: find a small tree consistent with the training examples
om
Idea: (recursively) choose “most significant” attribute as root of (sub)tree
.c
rs
function DTL(examples, attributes, default) returns a decision tree
ta
if examples is empty then return default
kS
else if all examples have the same classification then return the classification
oc
else if attributes is empty then return Mode(examples)
R
else
nt
best ← Choose-Attribute(attributes, examples)
de
tree ← a new decision tree with root test best
tu
for each value vi of best do

.S
examplesi ← {elements of examples with best = vi }

w
w
subtree ← DTL(examplesi, attributes − best, Mode(examples))

w
add a branch to tree with label vi and subtree subtree

return tree

Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”
om
.c
rs
ta
kS
Patrons? Type?
oc
None Some Full French Italian Thai Burger
R
nt
de
tu
.S
P atrons? is a better choice—gives information about the classification

w
w
w

Information
Information answers questions
The more clueless I am about the answer initially, the more information is
om
contained in the answer www.StudentRockStars.com
.c
rs
Scale: 1 bit = answer to Boolean question with prior h0.5, 0.5i
ta
kS
Information in an answer when prior is hP1, . . . , Pni is
oc
R
n
H(hP1, . . . , Pni) = Σi = 1 − Pi log2 Pi
nt
de
(also called entropy of the prior)
tu
.S
w
w
w

Information contd.
Suppose we have p positive and n negative examples at the root
⇒ H(hp/(p+n), n/(p+n)i) bits needed to classify a new example
om
E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit
.c
rs
An attribute splits the examples E into subsets Ei, each of which (we hope)
ta
needs less information to complete the classification
kS
oc
Let Ei have pi positive and ni negative examples
R
⇒ H(hpi/(pi +ni), ni/(pi +ni)i) bits needed to classify a new example
nt
⇒ expected number of bits per example over all branches is
de
tu
pi + n i
Σi H(hpi/(pi + ni), ni/(pi + ni)i)
.S
p+n
w
w
For P atrons?, this is 0.459 bits, for T ype this is (still) 1 bit
w
⇒ choose the attribute that minimizes the remaining information needed

Example contd.
Decision tree learned from the 12 examples:
om
Patrons?
.c
None Some Full
rs
ta
F T Hungry?
kS
Yes No
oc
R
Type? F
nt
French Italian de Thai Burger
tu
T F Fri/Sat? T
.S
No Yes
w
w
F T
w
Substantially simpler than “true” tree—a more complex hypothesis isn’t jus-
tified by small amount of data

Performance measurement
How do we know that h ≈ f ? (Hume’s Problem of Induction)
1) Use theorems of computational/statistical learning theory
om
.c
2) Try h on a new test set of examples
rs
(use same distribution over example space as training set)
ta
kS
Learning curve = % correct on test set as a function of training set size
oc
1
R
% correct on test set
nt
0.9
de
0.8
tu
.S
0.7
w
0.6
w
w
0.5
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Performance measurement contd.

Learning curve depends on
– realizable (can express target function) vs. non-realizable
om
non-realizability can be due to missing attributes
.c
or restricted hypothesis class (e.g., thresholded linear function)
rs
– redundant expressiveness (e.g., loads of irrelevant attributes)
ta
kS
% correct
oc
1
R
realizable
nt
de
redundant
tu
nonrealizable
.S
w
w
w
# of examples

Summary
Learning needed for unknown environments, lazy designers
om
Learning agent = performance element + learning element
.c
Learning method depends on type of performance element, available
rs
feedback, type of component to be improved, and its representation
ta
kS
For supervised learning, the aim is to find a simple hypothesis
oc
that is approximately consistent with training examples
R
nt
de
Decision tree learning using information gain
tu
Learning performance = prediction accuracy measured on test set

.S
w
w
w

om
Statistical learning
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Outline
♦ Bayesian learning
♦ Maximum a posteriori and maximum likelihood learning
om
♦ Bayes net learning
.c
– ML parameter learning with complete data
rs
ta
– linear regression
kS
oc
R
nt
de
tu
.S
w
w
w
Full Bayesian learning

View learning as Bayesian updating of a probability distribution
over the hypothesis space
om
H is the hypothesis variable, values h1, h2, . . ., prior P(H)
.c
rs
jth observation dj gives the outcome of random variable Dj
ta
training data d = d1, . . . , dN www.StudentRockStars.com
kS
oc
Given the data so far, each hypothesis has a posterior probability:
R
nt
P (hi|d) = αP (d|hi)P (hi)de
tu
where P (d|hi) is called the likelihood
.S
w
Predictions use a likelihood-weighted average over the hypotheses:

w
w
P(X|d) = Σi P(X|d, hi)P (hi|d) = Σi P(X|hi)P (hi|d)

No need to pick one best-guess hypothesis!
Example
Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies
om
.c
rs
ta
10% are h5: 100% lime candies
kS
oc
R
nt
de
tu
Then we observe candies drawn from some bag:

.S
w
What kind of bag is it? What flavour will the next candy be?
w
w
Posterior probability of hypotheses
1
P(h1 | d)
Posterior probability of hypothesis
P(h2 | d)
om
P(h3 | d)
0.8 P(h4 | d)
.c
P(h5 | d)
rs
ta
0.6
kS
oc
R
0.4
nt
de
tu
0.2
.S
w
w
0
w
0 2 4 6 8 10
Number of samples in d
Prediction probability
om
0.9
P(next candy is lime | d)
.c
rs
0.8
ta
kS
oc
0.7
R
nt
0.6 de
tu
.S
0.5
w
w
0.4
w
0 2 4 6 8 10
Number of samples in d
MAP approximation
Summing over the hypothesis space is often intractable
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)
om
Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)
.c
rs
I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi)
ta
kS
Log terms can be viewed as (negative of)
oc
bits to encode data given hypothesis + bits to encode hypothesis
R
This is the basic idea of minimum description length (MDL) learning
nt
de
For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise
tu
⇒ MAP = simplest consistent hypothesis (cf. science)
.S
w
w
w
ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi)
om
I.e., simply get the best fit to the data; identical to MAP for uniform prior
.c
rs
(which is reasonable if all hypotheses are of the same complexity)
ta
kS
ML is the “standard” (non-Bayesian) statistical learning method
oc
R
nt
de
tu
.S
w
w
w
ML parameter learning in Bayes nets

Bag from a new manufacturer; fraction θ of cherry candies? P (F=cherry )
Any θ is possible: continuum of hypotheses hθ θ
θ is a parameter for this simple (binomial) family of models
om
Flavor
.c
Suppose we unwrap N candies, c cherries and ` = N − c limes
rs
These are i.i.d. (independent, identically distributed) observations, so
ta
kS
N
Y
P (d|hθ ) = P (dj |hθ ) = θc · (1 − θ)`
oc
j =1
R
nt
Maximize this w.r.t. θ—which is easier for the log-likelihood:
de
N
tu
X
L(d|hθ ) = log P (d|hθ ) = log P (dj |hθ ) = c log θ + ` log(1 − θ)
.S
j =1
w
dL(d|hθ ) c ` c c
w
= − =0 ⇒ θ= =
dθ θ 1−θ c+` N
w
Seems sensible, but causes problems with 0 counts!
Multiple parameters
Red/green wrapper depends probabilistically on flavor: P (F=cherry )
θ
Likelihood for, e.g., cherry candy in green wrapper:
Flavor
om
P (F = cherry, W = green|hθ,θ1,θ2 )
.c
F P(W=red | F )
θ1
rs
= P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) cherry
θ2
ta
lime
= θ · (1 − θ1)
kS
oc
Wrapper
N candies, rc red-wrapped cherry candies, etc.:
R
nt
r
P (d|hθ,θ1,θ2 ) = θ c(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`
de
tu
.S
w
L = [c log θ + ` log(1 − θ)]

w
w
+ [rc log θ1 + gc log(1 − θ1)]

+ [r` log θ2 + g` log(1 − θ2)]
Multiple parameters contd.

Derivatives of L contain only the relevant parameter:
∂L c ` c
= − =0 ⇒ θ=
om
∂θ θ 1−θ c+`
.c
rs
∂L rc gc rc
ta
= − =0 ⇒ θ1 =
kS
∂θ1 θ1 1 − θ 1 r c + gc
oc
R
∂L r` g` r`
nt
= − =0 de ⇒ θ2 =
∂θ2 θ2 1 − θ 2 r ` + g`
tu
With complete data, parameters can be learned separately

.S
w
w
w
Example: linear Gaussian model

1
0.8
P(y |x)
4 0.6
om
3.5
y
3
2.5 0.4
2
.c
1.5
rs
1 1
0.5 0.8 0.2
ta
0 0.6
0 0.2 0.4 y
0.4 0.6
kS
0.2 0
x 0.8 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
oc
x
R
(y−(θ1 x+θ2 ))2
nt
1 −
Maximizing P (y|x) = √ e 2σ 2 de w.r.t. θ1, θ2
2πσ
tu
.S
N
X
= minimizing E = (yj − (θ1xj + θ2))2
w
j =1
w
w
That is, minimizing the sum of squared errors gives the ML solution
for a linear fit assuming Gaussian noise of fixed variance
Summary
Full Bayesian learning gives best possible predictions but is intractable
MAP learning balances complexity with accuracy on training data
om
Maximum likelihood assumes uniform prior, OK for large data sets
.c
rs
1. Choose a parameterized family of models to describe the data
ta
kS
requires substantial insight and sometimes new models
oc
2. Write down the likelihood of the data as a function of the parameters
R
nt
may require summing over hidden variables, i.e., inference
tu
de
3. Write down the derivative of the log likelihood w.r.t. each parameter
.S
w
4. Find the parameter values such that the derivatives are zero
w
may be hard/impossible; modern optimization techniques help

w
om
Neural networks
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w
Chapter 20, Section 5 1

Outline
♦ Brains
♦ Neural networks
om
.c
♦ Perceptrons
rs
ta
♦ Multilayer perceptrons
kS
oc
♦ Applications of neural networks
R
nt
de
tu
.S
w
w
w

Brains
1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time
Signals are noisy “spike trains” of electrical potential
om
.c
rs
Axonal arborization
ta
kS
Axon from another cell
oc
Synapse
R
nt
Dendrite Axon
de
tu
.S
Nucleus
w
w
Synapses
w
Cell body or Soma www.StudentRockStars.com

McCulloch–Pitts “unit”
Output is a “squashed” linear function of the inputs:

ai ← g(ini) = g Σj Wj,iaj
om
.c
Bias Weight
a0 = −1
rs
W0,i ai = g(ini)
ta
kS
g
ini
oc
Σ
Wj,i
aj ai
R
nt
de
tu
Input Input Activation Output
Output
.S
Links Function Function Links

w
w
A gross oversimplification of real neurons, but its purpose is

w
to develop understanding of what networks of simple units can do

Activation functions
g(ini) g(ini)
om
+1 +1
.c
rs
ta
kS
oc
R
ini ini
nt
(a) de (b)
tu
.S
(a) is a step function or threshold function

w
w
(b) is a sigmoid function 1/(1 + e−x)

w
Changing the bias weight W0,i moves the threshold location

Implementing logical functions

W0 = 1.5 W0 = 0.5 W0 = – 0.5
om
W1 = 1 W1 = 1
.c
W1 = –1
rs
W2 = 1 W2 = 1
ta
kS
AND OR NOT
oc
McCulloch and Pitts: every Boolean function can be implemented
R
nt
de
tu
.S
w
w
w

Network structures
Feed-forward networks:
– single-layer perceptrons
om
– multi-layer perceptrons
.c
rs
Feed-forward networks implement functions, have no internal state
ta
kS
Recurrent networks:
oc
– Hopfield networks have symmetric weights (Wi,j = Wj,i)
R
g(x) = sign(x), ai = ± 1; holographic associative memory
nt
– Boltzmann machines use stochastic activation functions,
de
≈ MCMC in Bayes nets
tu
.S
– recurrent neural nets have directed cycles with delays

w
⇒ have internal state (like flip-flops), can oscillate etc.

w
w

Feed-forward example
W1,3
1 3
om
W3,5
W1,4
.c
rs
ta
5
kS
oc
W2,3 W4,5
R
nt
2 4
W2,4 de
tu
.S
Feed-forward network = a parameterized family of nonlinear functions:

w
w
w
a5 = g(W3,5 · a3 + W4,5 · a4)

= g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2))
Adjusting weights changes the function: do learning this way!

Single-layer perceptrons
om
.c
Perceptron output
rs
1
ta
0.8
kS
0.6
0.4
oc
0.2 4
2
R
0 0 x2
-4 -2 -2
Output
nt
Input x1
0 2 -4
Wj,i de 4
Units Units
tu
.S
Output units all operate separately—no shared weights

w
w
w
Adjusting weights moves the location, orientation, and steepness of cliff

Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
om
Can represent AND, OR, NOT, majority, etc., but not XOR
.c
Represents a linear separator in input space:
rs
ta
Σj Wj xj > 0 or W · x > 0
kS
oc
x1 x1 x1
R
nt
1 1 1
de
tu
?
.S
w
0 0 0
w
0 1 x2 0 1 x2 0 1 x2
w
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Minsky & Papert (1969) pricked the neural network balloon

Perceptron learning
Learn by adjusting weights to reduce error on training set
om
The squared error for an example with input x and true output y is
.c
1 2 1
E = Err ≡ (y − hW(x))2 ,
rs
2 2
ta
kS
Perform optimization search by gradient descent:
oc
∂E ∂Err ∂
R
n
= Err × = Err × y − g(Σj = 0Wj xj )
nt
∂Wj ∂Wj ∂Wj de
= −Err × g 0(in) × xj
tu
.S
Simple weight update rule:

w
w
Wj ← Wj + α × Err × g 0(in) × xj
w
E.g., +ve error ⇒ increase network output

⇒ increase weights on +ve inputs, decrease on -ve inputs

Perceptron learning contd.

Perceptron learning rule converges to a consistent function
for any linearly separable data set
om
.c
Proportion correct on test set
rs
1 1
ta
0.9 0.9
kS
0.8 0.8
oc
0.7 0.7
R
0.6 Perceptron 0.6
nt
Decision tree de
0.5 0.5 Perceptron
Decision tree
tu
0.4 0.4
.S
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
w
Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data
w
w
Perceptron learns majority function easily, DTL is hopeless

DTL learns restaurant function easily, perceptron cannot represent it

Multilayer perceptrons
Layers are usually fully connected;
numbers of hidden units typically chosen by hand
om
Output units ai
.c
rs
ta
Wj,i
kS
oc
R
Hidden units aj
nt
de
Wk,j
tu
.S
w
w
Input units ak
w

Expressiveness of MLPs
All continuous functions w/ 2 layers, all functions w/ 3 layers
om
.c
hW(x1, x2) hW(x1, x2)
rs
1 1
ta
0.8 0.8
kS
0.6 0.6
0.4 0.4
oc
0.2 4 0.2 4
R
0 2 0 2
-4 -2 0 x2 -4 -2 0 x2
-2 -2
nt
0 2 -4 0 2 -4
x1 4 x1 4
de
tu
.S
Combine two opposite-facing threshold functions to make a ridge

w
w
w
Combine two perpendicular ridges to make a bump

Add bumps of various sizes and locations to fit any surface
Proof requires exponentially many hidden units (cf DTL proof)
Back-propagation learning
Output layer: same as for single-layer perceptron,
om
Wj,i ← Wj,i + α × aj × ∆i
.c
where ∆i = Err i × g 0(in i)
rs
ta
Hidden layer: back-propagate the error from the output layer:
kS
oc
X
∆j = g 0(in j ) Wj,i∆i .
R
i
nt
Update rule for weights in hidden layer:de
tu
Wk,j ← Wk,j + α × ak × ∆j .
.S
w
(Most neuroscientists deny that back-propagation occurs in the brain)

w
w

Back-propagation derivation
The squared error on a single example is defined as
1X
om
E= (yi − ai)2 ,
2 i
.c
rs
where the sum is over the nodes in the output layer.
ta
kS
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
oc
∂Wj,i ∂Wj,i ∂Wj,i
R
 
∂in i ∂ X
nt
= −(yi − ai)g 0(in i) de= −(yi − ai)g 0(in i) 
 Wj,iaj 
∂Wj,i ∂Wj,i j
= −(yi − ai)g 0(in i)aj = −aj ∆i
tu
.S
w
w
w

Back-propagation derivation contd.
∂E X ∂ai X ∂g(in i)
om
= − (yi − ai) = − (yi − ai)
∂Wk,j i ∂Wk,j i ∂Wk,j
.c
 
∂in ∂
rs
X i X X
= − (yi − ai)g 0(in i) = − ∆i  Wj,iaj 
ta
i ∂Wk,j i ∂Wk,j j
kS
X ∂aj X ∂g(in j )
= − ∆iWj,i = − ∆iWj,i
oc
i ∂Wk,j i ∂Wk,j
R
∂in j
nt
X
= − ∆iWj,ig 0(in j ) de
i ∂Wk,j
tu
 
X ∂ X
= − ∆iWj,ig 0(in j )
.S

 Wk,j ak 
∂Wk,j k
w
i
w
X
= − ∆iWj,ig 0(in j )ak = −ak ∆j
w

Back-propagation learning contd.

At each epoch, sum gradient updates for all examples and apply
om
Training curve for 100 restaurant examples: finds exact fit
.c
14
Total error on training set
rs
12
ta
kS
10
oc
8
R
6
nt
4 de
tu
2
.S
0
w
0 50 100 150 200 250 300 350 400

w
Number of epochs
w
Typical problems: slow convergence, local minima

Back-propagation learning contd.

Learning curve for MLP with 4 hidden units:
om
1
.c
0.9
rs
ta
0.8
kS
0.7
oc
R
0.6 Decision tree
nt
Multilayer network
0.5 de
tu
0.4
.S
0 10 20 30 40 50 60 70 80 90 100
w
w
Training set size - RESTAURANT data

w
MLPs are quite good for complex pattern recognition tasks,

but resulting hypotheses cannot be understood easily

Handwritten digit recognition
om
.c
rs
ta
kS
oc
3-nearest-neighbor = 2.4% error
R
400–300–10 unit MLP = 1.6% error
nt
LeNet: 768–192–30–10 unit MLP = 0.9% error
de
tu
Current best (kernel machines, vision algorithms) ≈ 0.6% error
.S
w
w
w

Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
om
Perceptrons (one-layer networks) insufficiently expressive
.c
Multi-layer networks are sufficiently expressive; can be trained by gradient
rs
descent, i.e., error back-propagation
ta
kS
Many applications: speech, driving, handwriting, fraud detection, etc.
oc
R
Engineering, cognitive modelling, and neural system modelling
nt
subfields have largely diverged de
tu
.S
w
w
w


Learning from Observations: Inductive Learning and Decision Trees

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning from Observations: Inductive Learning and Decision Trees

Uploaded by

Copyright:

Available Formats

www.StudentRockStars.

Chapter 18, Sections 1–3 2

Chapter 18, Sections 1–3 3

Chapter 18, Sections 1–3 4

Utility−based agent Transition model Dynamic Bayes net Outcome

Simple reflex agent Percept−action fn Neural net Correct action

Supervised learning: correct answers for each instance

Chapter 18, Sections 1–3 5

Inductive learning (a.k.a. Science)

Simplest form: learn a function from examples (tabula rasa)

(This is a highly simplified model of real learning:

– Ignores prior knowledge

– Assumes a deterministic, observable “environment”

Chapter 18, Sections 1–3 6

Inductive learning method

Chapter 18, Sections 1–3 7

Inductive learning method

Chapter 18, Sections 1–3 8

Inductive learning method

Chapter 18, Sections 1–3 9

Inductive learning method

Chapter 18, Sections 1–3 10

Inductive learning method

Chapter 18, Sections 1–3 11

Inductive learning method

Ockham’s razor: maximize a combination of consistency and simplicity

X8 F F F T Some $$ T T Thai 0–10 T

X9 F T T F Full $ T F Burger >60 F

X10 T T T T Full $$$ F T Italian 10–30 F

Classification of examples is positive (T) or negative (F)

Reservation? Fri/Sat? T Alternate?

No Yes No Yes No Yes

Chapter 18, Sections 1–3 14

but it probably won’t generalize to new examples

Prefer to find more compact decision trees

Chapter 18, Sections 1–3 15

Chapter 18, Sections 1–3 16

Chapter 18, Sections 1–3 17

Chapter 18, Sections 1–3 18

Chapter 18, Sections 1–3 19

Chapter 18, Sections 1–3 20

Chapter 18, Sections 1–3 21

More expressive hypothesis space www.StudentRockStars.com

– increases chance that target function can be expressed

– increases number of hypotheses consistent w/ training set

⇒ may get worse predictions

Chapter 18, Sections 1–3 22

Decision tree learning

for each value vi of best do

examplesi ← {elements of examples with best = vi }

subtree ← DTL(examplesi, attributes − best, Mode(examples))

add a branch to tree with label vi and subtree subtree

Chapter 18, Sections 1–3 23

P atrons? is a better choice—gives information about the classification

Chapter 18, Sections 1–3 24

Chapter 18, Sections 1–3 25

⇒ choose the attribute that minimizes the remaining information needed

Chapter 18, Sections 1–3 26

Chapter 18, Sections 1–3 27

Performance measurement contd.

Chapter 18, Sections 1–3 29

Learning performance = prediction accuracy measured on test set