Professional Documents
Culture Documents
Module 5 - ch4 Tom Mitchell PDF
Module 5 - ch4 Tom Mitchell PDF
[Read Ch. 4]
[Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]
Threshold units
Gradient descent
Multilayer networks
Backpropagation
Hidden layer representations
Example: Face Recognition
Advanced topics
74 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Connectionist Models
Consider humans:
Neuron switching time ~ :001 second
Number of neurons ~ 10 10
75 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
When to Consider Neural Networks
76 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
ALVINN drives 70 mph on highways
30 Output
Units
4 Hidden
Units
30x32 Sensor
Input Retina
77 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Perceptron
x1 w1 x0=1
w0
x2 w2
.
. Σ n
Σ wi xi n
Σw
AA
xn
. wn i=0 o=
{ 1 if
i=0 i i
-1 otherwise
x >0
8
w + w x + + wnxn > 0
o(x ; : : : ; xn) = ?11 ifotherwise.
>
< 0 1 1
1 >
:
78 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Decision Surface of a Perceptron
x2 x2
+
+
- + -
+
x1 x1
- - +
-
(a) (b)
79 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Perceptron training rule
wi wi + wi
where
wi = (t ? o)xi
Where:
t = c(~x) is target value
o is perceptron output
is small constant (e.g., .1) called learning rate
80 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Perceptron training rule
81 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
82 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
25
20
15
E[w]
10
0
2
1
-2
-1
0 0
1
2
-1 3
w0 w1
Gradient
@E @E @E 23
rE [w~ ] @w ; @w ; @w 75
64
n 0 1
Training rule:
w~ = ?rE [w~ ]
i.e.,
@E
wi = ? @w
i
83 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
@E = @ 1 X(t ? o ) 2
@wi @wi 2 d d d
1 X @
= 2 d @w (td ? od) 2
i
@ (t ? o )
= 21 Xd 2(td ? od) @w d d
i
@
= d (td ? od) @w (td ? w~ x~d)
X
i
@E = X(t ? o )(?x )
@wi d d d i;d
84 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
85 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Summary
86 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Incremental (Stochastic) Gradient Descent
1
Ed[w~ ] 2 (td ? od) 2
87 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Multilayer Networks of Sigmoid Units
F1 F2
88 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Sigmoid Unit
AA
x1 w1 x0 = 1
x2 w2 w0
AA
.
. Σ n
net = Σ wi xi 1
A
. i=0 o = σ(net) = -net
wn
1+e
xn
89 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Error Gradient for a Sigmoid Unit
@E = @ 1 X (t ? o ) 2
1i
0
@od CA
= Xd (td ? od) B@? @w
i
@o
= ? d (td ? od) @net @w
X d @netd
d i
But we know:
@od = @(netd) = o (1 ? o )
@netd @netd d d
So:
@E = ? X (t ? o )o (1 ? o )x
@wi d2D
d d d d i;d
90 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Backpropagation Algorithm
91 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
More on Backpropagation
92 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning Hidden Layer Representations
Inputs Outputs
A target function:
Input Output
10000000 ! 10000000
01000000 ! 01000000
00100000 ! 00100000
00010000 ! 00010000
00001000 ! 00001000
00000100 ! 00000100
00000010 ! 00000010
00000001 ! 00000001
Can this be learned??
93 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning Hidden Layer Representations
94 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Training
95 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Training
96 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Training
97 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Convergence of Backpropagation
98 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Expressive Capabilities of ANNs
Boolean functions:
Every boolean function can be represented by
network with single hidden layer
but might require exponential (in number of
inputs) hidden units
Continuous functions:
Every bounded continuous function can be
approximated with arbitrarily small error, by
network with one hidden layer [Cybenko 1989;
Hornik et al. 1989]
Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
[Cybenko 1988].
99 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Overtting in ANNs
0.006
0.005
0.004
0.003
0.002
0 5000 10000 15000 20000
Number of weight updates
0.04
0.03
0.02
0.01
0
0 1000 2000 3000 4000 5000 6000
Number of weight updates
100 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Neural Nets for Face Recognition
left strt rght up
101 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Learned Hidden Unit Weights
left strt rght up Learned Weights
http://www.cs.cmu.edu/tom/faces.html
102 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Alternative Error Functions
103 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Recurrent Networks
y(t + 1) y(t + 1)
y(t + 1)
x(t) c(t)
y(t)
x(t – 1) c(t – 1)
y(t – 1)
x(t – 2) c(t – 2)
(c) Recurrent network
unfolded in time
104 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997