You are on page 1of 64

www.StudentRockStars.

com

om
Learning from Observations

.c
rs
ta
Unit 8

kS
oc
R
nt
de
tu
.S
w
w
w

www.StudentRockStars.com
www.StudentRockStars.com

Outline
♦ Learning agents
♦ Inductive learning

om
.c
♦ Decision tree learning

rs
ta
♦ Measuring learning performance

kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 2


www.StudentRockStars.com
www.StudentRockStars.com

Learning
Learning is essential for unknown environments,
i.e., when designer lacks omniscience

om
Learning is useful as a system construction method,

.c
rs
i.e., expose the agent to reality rather than trying to write it down

ta
kS
Learning modifies the agent’s decision mechanisms to improve performance

oc
om
www.StudentRockStars.c

R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 3


www.StudentRockStars.com
www.StudentRockStars.com

Learning agents
Performance standard

om
Critic Sensors

.c
rs
ta
feedback

kS

Environment
oc
changes

R
nt
Learning Performance
element de element
knowledge
tu
learning
.S

goals
w
w

experiments
Problem
w

generator

Agent Effectors

Chapter 18, Sections 1–3 4


www.StudentRockStars.com
www.StudentRockStars.com

Learning element
Design of learning element is dictated by
♦ what type of performance element is used

om
♦ which functional component is to be learned

.c
♦ how that functional compoent is represented

rs
♦ what kind of feedback is available

ta
Example scenarios:

kS
oc
Performance element Component Representation Feedback

R
nt
Alpha−beta search Eval. fn. de Weighted linear function Win/loss
tu
Logical agent Transition model Successor−state axioms Outcome
.S

Utility−based agent Transition model Dynamic Bayes net Outcome


w
w

Simple reflex agent Percept−action fn Neural net Correct action


w

Supervised learning: correct answers for each instance


Reinforcement learning: occasional rewards

Chapter 18, Sections 1–3 5


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning (a.k.a. Science)

Simplest form: learn a function from examples (tabula rasa)

om
f is the target function

.c
O O X

rs
ta
An example is a pair x, f (x), e.g., X , +1

kS
X

oc
Problem: find a(n) hypothesis h

R
nt
such that h ≈ f de
given a training set of examples
tu
.S

(This is a highly simplified model of real learning:


w

– Ignores prior knowledge


w
w

– Assumes a deterministic, observable “environment”


– Assumes examples are given
– Assumes that the agent wants to learn f —why?)

Chapter 18, Sections 1–3 6


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning method


Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

om
E.g., curve fitting:

.c
rs
f(x)

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 7


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning method


Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

om
E.g., curve fitting:

.c
rs
f(x)

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 8


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning method


Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

om
E.g., curve fitting:

.c
rs
f(x)

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 9


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning method


Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

om
E.g., curve fitting:

.c
rs
f(x)

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 10


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning method


Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

om
E.g., curve fitting:

.c
rs
f(x)

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 11


www.StudentRockStars.com
www.StudentRockStars.com

Inductive learning method


Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

om
E.g., curve fitting: com

.c
o c k S tars.
ntR

rs
w w w .Stude
f(x)

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Ockham’s razor: maximize a combination of consistency and simplicity


Chapter 18, Sections 1–3 12
www.StudentRockStars.com
www.StudentRockStars.com

Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous, etc.)
E.g., situations where I will/won’t wait for a table:

om
Attributes Target

.c
Example
Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait

rs
ta
X1 T F F T Some $$$ F T French 0–10 T

kS
X2 T F F T Full $ F F Thai 30–60 F

oc
X3 F T F F Some $ F F Burger 0–10 T

R
X4 T F T T Full $ F F Thai 10–30 T

nt
X5 T F T F Full $$$
de F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
tu
X7 F T F F None $ T F Burger 0–10 F
.S

X8 F F F T Some $$ T T Thai 0–10 T


w

X9 F T T F Full $ T F Burger >60 F


w
w

X10 T T T T Full $$$ F T Italian 10–30 F


X11 F F F F None $ F F Thai 0–10 F
X12 T T T T Full $ F F Burger 30–60 T

Classification of examples is positive (T) or negative (F)


Chapter 18, Sections 1–3 13
www.StudentRockStars.com
www.StudentRockStars.com

Decision trees
One possible representation for hypotheses
E.g., here is the “true” tree for deciding whether to wait:

om
.c
Patrons?

rs
ta
None Some Full

kS
F T WaitEstimate?

oc
R
>60 30−60 10−30 0−10

nt
F Alternate? de Hungry? T
No Yes No Yes
tu
.S

Reservation? Fri/Sat? T Alternate?


w

No Yes No Yes No Yes


w
w

Bar? T F T T Raining?
No Yes No Yes

F T F T

Chapter 18, Sections 1–3 14


www.StudentRockStars.com
www.StudentRockStars.com

Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:

om
A

.c
A B A xor B F T

rs
F F F

ta
B B

kS
F T T
F T F T
T F T

oc
T T F

R
F T T F

nt
de
Trivially, there is a consistent decision tree for any training set
tu
w/ one path to leaf for each example (unless f nondeterministic in x)
.S

but it probably won’t generalize to new examples


w
w
w

Prefer to find more compact decision trees

Chapter 18, Sections 1–3 15


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 16


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
= number of Boolean functions

.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 17


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
= number of Boolean functions
= number of distinct truth tables with 2n rows

.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 18


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 19


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 20


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

ta
kS
How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)??

oc
R
nt
de
tu
.S
w
w
w

Chapter 18, Sections 1–3 21


www.StudentRockStars.com
www.StudentRockStars.com

Hypothesis spaces
How many distinct decision trees with n Boolean attributes??

om
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

.c
rs
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

ta
kS
How many purely conjunctive hypotheses (e.g., Hungry ∧ ¬Rain)??

oc
R
Each attribute can be in (positive), in (negative), or out
nt
⇒ 3n distinct conjunctive hypotheses
de
tu

More expressive hypothesis space www.StudentRockStars.com


.S
w

– increases chance that target function can be expressed


w

– increases number of hypotheses consistent w/ training set


w

⇒ may get worse predictions

Chapter 18, Sections 1–3 22


www.StudentRockStars.com
www.StudentRockStars.com

Decision tree learning


Aim: find a small tree consistent with the training examples

om
Idea: (recursively) choose “most significant” attribute as root of (sub)tree

.c
rs
function DTL(examples, attributes, default) returns a decision tree

ta
if examples is empty then return default

kS
else if all examples have the same classification then return the classification

oc
else if attributes is empty then return Mode(examples)

R
else

nt
best ← Choose-Attribute(attributes, examples)
de
tree ← a new decision tree with root test best
tu

for each value vi of best do


.S

examplesi ← {elements of examples with best = vi }


w
w

subtree ← DTL(examplesi, attributes − best, Mode(examples))


w

add a branch to tree with label vi and subtree subtree


return tree

Chapter 18, Sections 1–3 23


www.StudentRockStars.com
www.StudentRockStars.com

Choosing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”

om
.c
rs
ta
kS
Patrons? Type?

oc
None Some Full French Italian Thai Burger

R
nt
de
tu
.S

P atrons? is a better choice—gives information about the classification


w
w
w

Chapter 18, Sections 1–3 24


www.StudentRockStars.com
www.StudentRockStars.com

Information
Information answers questions
The more clueless I am about the answer initially, the more information is

om
contained in the answer www.StudentRockStars.com

.c
rs
Scale: 1 bit = answer to Boolean question with prior h0.5, 0.5i

ta
kS
Information in an answer when prior is hP1, . . . , Pni is

oc
R
n
H(hP1, . . . , Pni) = Σi = 1 − Pi log2 Pi

nt
de
(also called entropy of the prior)
tu
.S
w
w
w

Chapter 18, Sections 1–3 25


www.StudentRockStars.com
www.StudentRockStars.com

Information contd.
Suppose we have p positive and n negative examples at the root
⇒ H(hp/(p+n), n/(p+n)i) bits needed to classify a new example

om
E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit

.c
rs
An attribute splits the examples E into subsets Ei, each of which (we hope)

ta
needs less information to complete the classification

kS
oc
Let Ei have pi positive and ni negative examples

R
⇒ H(hpi/(pi +ni), ni/(pi +ni)i) bits needed to classify a new example

nt
⇒ expected number of bits per example over all branches is
de
tu
pi + n i
Σi H(hpi/(pi + ni), ni/(pi + ni)i)
.S

p+n
w
w

For P atrons?, this is 0.459 bits, for T ype this is (still) 1 bit
w

⇒ choose the attribute that minimizes the remaining information needed

Chapter 18, Sections 1–3 26


www.StudentRockStars.com
www.StudentRockStars.com

Example contd.
Decision tree learned from the 12 examples:

om
Patrons?

.c
None Some Full

rs
ta
F T Hungry?

kS
Yes No

oc
R
Type? F

nt
French Italian de Thai Burger
tu
T F Fri/Sat? T
.S

No Yes
w
w

F T
w

Substantially simpler than “true” tree—a more complex hypothesis isn’t jus-
tified by small amount of data

Chapter 18, Sections 1–3 27


www.StudentRockStars.com
www.StudentRockStars.com

Performance measurement
How do we know that h ≈ f ? (Hume’s Problem of Induction)
1) Use theorems of computational/statistical learning theory

om
.c
2) Try h on a new test set of examples

rs
(use same distribution over example space as training set)

ta
kS
Learning curve = % correct on test set as a function of training set size

oc
1

R
% correct on test set

nt
0.9
de
0.8
tu
.S

0.7
w

0.6
w
w

0.5
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Chapter 18, Sections 1–3 28
www.StudentRockStars.com
www.StudentRockStars.com

Performance measurement contd.


Learning curve depends on
– realizable (can express target function) vs. non-realizable

om
non-realizability can be due to missing attributes

.c
or restricted hypothesis class (e.g., thresholded linear function)

rs
– redundant expressiveness (e.g., loads of irrelevant attributes)

ta
kS
% correct

oc
1

R
realizable

nt
de
redundant
tu

nonrealizable
.S
w
w
w

# of examples

Chapter 18, Sections 1–3 29


www.StudentRockStars.com
www.StudentRockStars.com

Summary
Learning needed for unknown environments, lazy designers

om
Learning agent = performance element + learning element

.c
Learning method depends on type of performance element, available

rs
feedback, type of component to be improved, and its representation

ta
kS
For supervised learning, the aim is to find a simple hypothesis

oc
that is approximately consistent with training examples

R
nt
de
Decision tree learning using information gain
tu

Learning performance = prediction accuracy measured on test set


.S
w

www.StudentRockStars.com
w
w

Chapter 18, Sections 1–3 30


www.StudentRockStars.com
www.StudentRockStars.com

om
Statistical learning

.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

www.StudentRockStars.com
www.StudentRockStars.com

Outline
♦ Bayesian learning
♦ Maximum a posteriori and maximum likelihood learning

om
♦ Bayes net learning

.c
– ML parameter learning with complete data

rs
ta
– linear regression

kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 20, Sections 1–3 2

www.StudentRockStars.com
www.StudentRockStars.com

Full Bayesian learning


View learning as Bayesian updating of a probability distribution
over the hypothesis space

om
H is the hypothesis variable, values h1, h2, . . ., prior P(H)

.c
rs
jth observation dj gives the outcome of random variable Dj

ta
training data d = d1, . . . , dN www.StudentRockStars.com

kS
oc
Given the data so far, each hypothesis has a posterior probability:

R
nt
P (hi|d) = αP (d|hi)P (hi)de
tu
where P (d|hi) is called the likelihood
.S
w

Predictions use a likelihood-weighted average over the hypotheses:


w
w

P(X|d) = Σi P(X|d, hi)P (hi|d) = Σi P(X|hi)P (hi|d)


No need to pick one best-guess hypothesis!

Chapter 20, Sections 1–3 3

www.StudentRockStars.com
www.StudentRockStars.com

Example
Suppose there are five kinds of bags of candies:
10% are h1: 100% cherry candies
20% are h2: 75% cherry candies + 25% lime candies

om
40% are h3: 50% cherry candies + 50% lime candies

.c
20% are h4: 25% cherry candies + 75% lime candies

rs
ta
10% are h5: 100% lime candies

kS
oc
R
nt
de
tu

Then we observe candies drawn from some bag:


.S
w

What kind of bag is it? What flavour will the next candy be?
w
w

Chapter 20, Sections 1–3 4

www.StudentRockStars.com
www.StudentRockStars.com

Posterior probability of hypotheses

1
P(h1 | d)
Posterior probability of hypothesis
P(h2 | d)

om
P(h3 | d)
0.8 P(h4 | d)

.c
P(h5 | d)

rs
ta
0.6

kS
oc
R
0.4

nt
de
tu
0.2
.S
w
w

0
w

0 2 4 6 8 10
Number of samples in d

Chapter 20, Sections 1–3 5

www.StudentRockStars.com
www.StudentRockStars.com

Prediction probability

om
0.9
P(next candy is lime | d)

.c
rs
0.8

ta
kS
oc
0.7

R
nt
0.6 de
tu
.S

0.5
w
w

0.4
w

0 2 4 6 8 10
Number of samples in d

Chapter 20, Sections 1–3 6

www.StudentRockStars.com
www.StudentRockStars.com

MAP approximation
Summing over the hypothesis space is often intractable
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)

om
Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d)

.c
rs
I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi)

ta
kS
Log terms can be viewed as (negative of)

oc
bits to encode data given hypothesis + bits to encode hypothesis

R
This is the basic idea of minimum description length (MDL) learning

nt
de
For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise
tu
⇒ MAP = simplest consistent hypothesis (cf. science)
.S
w
w
w

Chapter 20, Sections 1–3 7

www.StudentRockStars.com
www.StudentRockStars.com

ML approximation
For large data sets, prior becomes irrelevant
Maximum likelihood (ML) learning: choose hML maximizing P (d|hi)

om
I.e., simply get the best fit to the data; identical to MAP for uniform prior

.c
rs
(which is reasonable if all hypotheses are of the same complexity)

ta
kS
ML is the “standard” (non-Bayesian) statistical learning method

oc
R
nt
de
tu
.S
w
w
w

Chapter 20, Sections 1–3 8

www.StudentRockStars.com
www.StudentRockStars.com

ML parameter learning in Bayes nets


Bag from a new manufacturer; fraction θ of cherry candies? P (F=cherry )
Any θ is possible: continuum of hypotheses hθ θ
θ is a parameter for this simple (binomial) family of models

om
Flavor

.c
Suppose we unwrap N candies, c cherries and ` = N − c limes

rs
These are i.i.d. (independent, identically distributed) observations, so

ta
kS
N
Y
P (d|hθ ) = P (dj |hθ ) = θc · (1 − θ)`

oc
j =1

R
nt
Maximize this w.r.t. θ—which is easier for the log-likelihood:
de
N
tu
X
L(d|hθ ) = log P (d|hθ ) = log P (dj |hθ ) = c log θ + ` log(1 − θ)
.S

j =1
w

dL(d|hθ ) c ` c c
w

= − =0 ⇒ θ= =
dθ θ 1−θ c+` N
w

Seems sensible, but causes problems with 0 counts!

Chapter 20, Sections 1–3 9

www.StudentRockStars.com
www.StudentRockStars.com

Multiple parameters
Red/green wrapper depends probabilistically on flavor: P (F=cherry )
θ
Likelihood for, e.g., cherry candy in green wrapper:
Flavor

om
P (F = cherry, W = green|hθ,θ1,θ2 )

.c
F P(W=red | F )
θ1

rs
= P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) cherry

θ2

ta
lime
= θ · (1 − θ1)

kS
oc
Wrapper
N candies, rc red-wrapped cherry candies, etc.:

R
nt
r
P (d|hθ,θ1,θ2 ) = θ c(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g`
de
tu

www.StudentRockStars.com
.S
w

L = [c log θ + ` log(1 − θ)]


w
w

+ [rc log θ1 + gc log(1 − θ1)]


+ [r` log θ2 + g` log(1 − θ2)]

Chapter 20, Sections 1–3 10

www.StudentRockStars.com
www.StudentRockStars.com

Multiple parameters contd.


Derivatives of L contain only the relevant parameter:
∂L c ` c
= − =0 ⇒ θ=

om
∂θ θ 1−θ c+`

.c
rs
∂L rc gc rc

ta
= − =0 ⇒ θ1 =

kS
∂θ1 θ1 1 − θ 1 r c + gc

oc
R
∂L r` g` r`
nt
= − =0 de ⇒ θ2 =
∂θ2 θ2 1 − θ 2 r ` + g`
tu

With complete data, parameters can be learned separately


.S
w
w
w

Chapter 20, Sections 1–3 11

www.StudentRockStars.com
www.StudentRockStars.com

Example: linear Gaussian model


1

0.8
P(y |x)
4 0.6

om
3.5

y
3
2.5 0.4
2

.c
1.5

rs
1 1
0.5 0.8 0.2

ta
0 0.6
0 0.2 0.4 y
0.4 0.6

kS
0.2 0
x 0.8 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

oc
x

R
(y−(θ1 x+θ2 ))2

nt
1 −
Maximizing P (y|x) = √ e 2σ 2 de w.r.t. θ1, θ2
2πσ
tu
.S

N
X
= minimizing E = (yj − (θ1xj + θ2))2
w

j =1
w
w

That is, minimizing the sum of squared errors gives the ML solution
for a linear fit assuming Gaussian noise of fixed variance

Chapter 20, Sections 1–3 12

www.StudentRockStars.com
www.StudentRockStars.com

Summary
Full Bayesian learning gives best possible predictions but is intractable
MAP learning balances complexity with accuracy on training data

om
Maximum likelihood assumes uniform prior, OK for large data sets

.c
rs
1. Choose a parameterized family of models to describe the data

ta
kS
requires substantial insight and sometimes new models

oc
2. Write down the likelihood of the data as a function of the parameters

R
nt
may require summing over hidden variables, i.e., inference
tu
de
3. Write down the derivative of the log likelihood w.r.t. each parameter
.S
w

4. Find the parameter values such that the derivatives are zero
w

may be hard/impossible; modern optimization techniques help


w

www.StudentRockStars.com

Chapter 20, Sections 1–3 13

www.StudentRockStars.com
www.StudentRockStars.com

om
Neural networks

.c
rs
ta
kS
oc
R
nt
de
tu
.S
w
w
w

Chapter 20, Section 5 1


www.StudentRockStars.com
www.StudentRockStars.com

Outline
♦ Brains
♦ Neural networks

om
.c
♦ Perceptrons

rs
ta
♦ Multilayer perceptrons

kS
oc
♦ Applications of neural networks

R
nt
de
tu
.S
w
w
w

Chapter 20, Section 5 2


www.StudentRockStars.com
www.StudentRockStars.com

Brains
1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time
Signals are noisy “spike trains” of electrical potential

om
.c
rs
Axonal arborization

ta
kS
Axon from another cell

oc
Synapse

R
nt
Dendrite Axon
de
tu
.S

Nucleus
w
w

Synapses
w

Cell body or Soma www.StudentRockStars.com

Chapter 20, Section 5 3


www.StudentRockStars.com
www.StudentRockStars.com

McCulloch–Pitts “unit”
Output is a “squashed” linear function of the inputs:
 
ai ← g(ini) = g Σj Wj,iaj

om
.c
Bias Weight
a0 = −1

rs
W0,i ai = g(ini)

ta
kS
g
ini

oc
Σ
Wj,i
aj ai

R
nt
de
tu
Input Input Activation Output
Output
.S

Links Function Function Links


w
w

A gross oversimplification of real neurons, but its purpose is


w

to develop understanding of what networks of simple units can do

Chapter 20, Section 5 4


www.StudentRockStars.com
www.StudentRockStars.com

Activation functions

g(ini) g(ini)

om
+1 +1

.c
rs
ta
kS
oc
R
ini ini

nt
(a) de (b)
tu
.S

(a) is a step function or threshold function


w
w

(b) is a sigmoid function 1/(1 + e−x)


w

Changing the bias weight W0,i moves the threshold location

Chapter 20, Section 5 5


www.StudentRockStars.com
www.StudentRockStars.com

Implementing logical functions


W0 = 1.5 W0 = 0.5 W0 = – 0.5

om
W1 = 1 W1 = 1

.c
W1 = –1

rs
W2 = 1 W2 = 1

ta
kS
AND OR NOT

oc
McCulloch and Pitts: every Boolean function can be implemented

R
nt
de
tu
.S
w
w
w

Chapter 20, Section 5 6


www.StudentRockStars.com
www.StudentRockStars.com

Network structures
Feed-forward networks:
– single-layer perceptrons

om
– multi-layer perceptrons

.c
rs
Feed-forward networks implement functions, have no internal state

ta
kS
Recurrent networks:

oc
– Hopfield networks have symmetric weights (Wi,j = Wj,i)

R
g(x) = sign(x), ai = ± 1; holographic associative memory

nt
– Boltzmann machines use stochastic activation functions,
de
≈ MCMC in Bayes nets
tu
.S

– recurrent neural nets have directed cycles with delays


w

⇒ have internal state (like flip-flops), can oscillate etc.


w
w

www.StudentRockStars.com

Chapter 20, Section 5 7


www.StudentRockStars.com
www.StudentRockStars.com

Feed-forward example

W1,3
1 3

om
W3,5
W1,4

.c
rs
ta
5

kS
oc
W2,3 W4,5

R
nt
2 4
W2,4 de
tu
.S

Feed-forward network = a parameterized family of nonlinear functions:


w
w
w

a5 = g(W3,5 · a3 + W4,5 · a4)


= g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2))
Adjusting weights changes the function: do learning this way!

Chapter 20, Section 5 8


www.StudentRockStars.com
www.StudentRockStars.com

Single-layer perceptrons

om
.c
Perceptron output

rs
1

ta
0.8

kS
0.6
0.4

oc
0.2 4
2

R
0 0 x2
-4 -2 -2
Output

nt
Input x1
0 2 -4
Wj,i de 4
Units Units
tu
.S

Output units all operate separately—no shared weights


w
w
w

Adjusting weights moves the location, orientation, and steepness of cliff

Chapter 20, Section 5 9


www.StudentRockStars.com
www.StudentRockStars.com

Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)

om
Can represent AND, OR, NOT, majority, etc., but not XOR

.c
Represents a linear separator in input space:

rs
ta
Σj Wj xj > 0 or W · x > 0

kS
oc
x1 x1 x1

R
nt
1 1 1
de
tu
?
.S
w

0 0 0
w

0 1 x2 0 1 x2 0 1 x2
w

(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2

Minsky & Papert (1969) pricked the neural network balloon

Chapter 20, Section 5 10


www.StudentRockStars.com
www.StudentRockStars.com

Perceptron learning
Learn by adjusting weights to reduce error on training set

om
The squared error for an example with input x and true output y is

.c
1 2 1
E = Err ≡ (y − hW(x))2 ,

rs
2 2

ta
kS
Perform optimization search by gradient descent:

oc
∂E ∂Err ∂  

R
n
= Err × = Err × y − g(Σj = 0Wj xj )

nt
∂Wj ∂Wj ∂Wj de
= −Err × g 0(in) × xj
tu
.S

Simple weight update rule:


w
w

Wj ← Wj + α × Err × g 0(in) × xj
w

E.g., +ve error ⇒ increase network output


⇒ increase weights on +ve inputs, decrease on -ve inputs

Chapter 20, Section 5 11


www.StudentRockStars.com
www.StudentRockStars.com

Perceptron learning contd.


Perceptron learning rule converges to a consistent function
for any linearly separable data set

om
.c
Proportion correct on test set

Proportion correct on test set

rs
1 1

ta
0.9 0.9

kS
0.8 0.8

oc
0.7 0.7

R
0.6 Perceptron 0.6

nt
Decision tree de
0.5 0.5 Perceptron
Decision tree
tu
0.4 0.4
.S

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
w

Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data
w
w

Perceptron learns majority function easily, DTL is hopeless


DTL learns restaurant function easily, perceptron cannot represent it

Chapter 20, Section 5 12


www.StudentRockStars.com
www.StudentRockStars.com

Multilayer perceptrons
Layers are usually fully connected;
numbers of hidden units typically chosen by hand

om
Output units ai

.c
rs
ta
Wj,i

kS
oc
R
Hidden units aj

nt
de
Wk,j
tu
.S
w
w

Input units ak
w

Chapter 20, Section 5 13


www.StudentRockStars.com
www.StudentRockStars.com

Expressiveness of MLPs
All continuous functions w/ 2 layers, all functions w/ 3 layers

om
.c
hW(x1, x2) hW(x1, x2)

rs
1 1

ta
0.8 0.8

kS
0.6 0.6
0.4 0.4

oc
0.2 4 0.2 4

R
0 2 0 2
-4 -2 0 x2 -4 -2 0 x2
-2 -2

nt
0 2 -4 0 2 -4
x1 4 x1 4
de
tu
.S

Combine two opposite-facing threshold functions to make a ridge


w
w
w

Combine two perpendicular ridges to make a bump


Add bumps of various sizes and locations to fit any surface
Proof requires exponentially many hidden units (cf DTL proof)
Chapter 20, Section 5 14
www.StudentRockStars.com
www.StudentRockStars.com

Back-propagation learning
Output layer: same as for single-layer perceptron,

om
Wj,i ← Wj,i + α × aj × ∆i

.c
where ∆i = Err i × g 0(in i)

rs
ta
Hidden layer: back-propagate the error from the output layer:

kS
oc
X
∆j = g 0(in j ) Wj,i∆i .

R
i

nt
Update rule for weights in hidden layer:de
tu

Wk,j ← Wk,j + α × ak × ∆j .
.S
w

(Most neuroscientists deny that back-propagation occurs in the brain)


w
w

www.StudentRockStars.com

Chapter 20, Section 5 15


www.StudentRockStars.com
www.StudentRockStars.com

Back-propagation derivation
The squared error on a single example is defined as
1X

om
E= (yi − ai)2 ,
2 i

.c
rs
where the sum is over the nodes in the output layer.

ta
kS
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)

oc
∂Wj,i ∂Wj,i ∂Wj,i

R
 
∂in i ∂ X

nt
= −(yi − ai)g 0(in i) de= −(yi − ai)g 0(in i) 
 Wj,iaj 
∂Wj,i ∂Wj,i j
= −(yi − ai)g 0(in i)aj = −aj ∆i
tu
.S
w
w
w

Chapter 20, Section 5 16


www.StudentRockStars.com
www.StudentRockStars.com

Back-propagation derivation contd.

∂E X ∂ai X ∂g(in i)

om
= − (yi − ai) = − (yi − ai)
∂Wk,j i ∂Wk,j i ∂Wk,j

.c
 
∂in ∂

rs
X i X X
= − (yi − ai)g 0(in i) = − ∆i  Wj,iaj 

ta
i ∂Wk,j i ∂Wk,j j

kS
X ∂aj X ∂g(in j )
= − ∆iWj,i = − ∆iWj,i

oc
i ∂Wk,j i ∂Wk,j

R
∂in j
nt
X
= − ∆iWj,ig 0(in j ) de
i ∂Wk,j
tu
 
X ∂ X
= − ∆iWj,ig 0(in j )
.S


 Wk,j ak 
∂Wk,j k
w

i
w

X
= − ∆iWj,ig 0(in j )ak = −ak ∆j
w

Chapter 20, Section 5 17


www.StudentRockStars.com
www.StudentRockStars.com

Back-propagation learning contd.


At each epoch, sum gradient updates for all examples and apply

om
Training curve for 100 restaurant examples: finds exact fit

.c
14
Total error on training set

rs
12

ta
kS
10

oc
8

R
6

nt
4 de
tu
2
.S

0
w

0 50 100 150 200 250 300 350 400


w

Number of epochs
w

Typical problems: slow convergence, local minima

Chapter 20, Section 5 18


www.StudentRockStars.com
www.StudentRockStars.com

Back-propagation learning contd.


Learning curve for MLP with 4 hidden units:

om
1
Proportion correct on test set

.c
0.9

rs
ta
0.8

kS
0.7

oc
R
0.6 Decision tree

nt
Multilayer network
0.5 de
tu
0.4
.S

0 10 20 30 40 50 60 70 80 90 100
w
w

Training set size - RESTAURANT data


w

MLPs are quite good for complex pattern recognition tasks,


but resulting hypotheses cannot be understood easily

Chapter 20, Section 5 19


www.StudentRockStars.com
www.StudentRockStars.com

Handwritten digit recognition

om
.c
rs
ta
kS
oc
3-nearest-neighbor = 2.4% error

R
400–300–10 unit MLP = 1.6% error

nt
LeNet: 768–192–30–10 unit MLP = 0.9% error
de
tu
Current best (kernel machines, vision algorithms) ≈ 0.6% error
.S
w
w
w

Chapter 20, Section 5 20


www.StudentRockStars.com
www.StudentRockStars.com

Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)

om
Perceptrons (one-layer networks) insufficiently expressive
www.StudentRockStars.com

.c
Multi-layer networks are sufficiently expressive; can be trained by gradient

rs
descent, i.e., error back-propagation

ta
kS
Many applications: speech, driving, handwriting, fraud detection, etc.

oc
R
Engineering, cognitive modelling, and neural system modelling
nt
subfields have largely diverged de
tu
.S
w
w
w

Chapter 20, Section 5 21


www.StudentRockStars.com

You might also like