Professional Documents
Culture Documents
Fall 2005
Ahmed Elgammal
Dept of Computer Science
Rutgers University
Neural Networks
Biological Motivation: Brain
• Networks of processing units (neurons) with connections (synapses)
between them
• Large number of neurons: 1011
• Large connectitivity: each connected to, on average, 104 others
• Switching time 10-3 second
• Parallel processing
• Distributed computation/memory
• Processing is done by neurons
and the memory is in the synapses
• Robust to noise, failures
1
Neural Networks
Characteristic of Biological Computation
• Massive Parallelism
• Locality of Computation
• Adaptive (Self Organizing)
• Representation is Distributed
2
• ALVINN system
Pomerleau (1993)
• Many successful
examples:
Speech phoneme
recognition [Waibel]
Image classification
[Kanade, Baluja,
Rowley]
Financial prediction
Backgammon [Tesauro]
3
Perceptron
d
y = ∑w j x j + w 0 = w T x
j =1
w = [w 0 , w 1 ,..., w d ]
T
x = [1 , x 1 ,..., x d ]
T
(Rosenblatt, 1962)
Perceptron
Activation Rule:
Linear threshold (step unit)
4
What a Perceptron Does
• 1 dimensional case:
• Regression: y=wx+w0 • Classification: y=1(wx+w0>0)
y y
s y
w0 w0
w w
x
w0
x x
x0=+1
5
Perceptron Decision Surface
6
Perceptron training rule
Can prove it will converge
• If training data is linearly separable
• and η sufficiently small
• Perceptron Conversion Theorem (Rosenblatt): if the data are linearly
separable then the perceptron learning algorithm converges in finite
time.
7
Error Surface
Gradient Descent
Gradient
∇E [w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn]
When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in E
Training rule:
∆w = -η ∇E [w]
in other words:
∆wi = -η ∂E/∂wi
This results in the following update rule:
8
Gradient of Error
∂E/∂wi
= Σd (td-od) (-xi,d)
Learning Rule:
∆wi = -η ∂E/∂wi
9
More Stochastic Grad. Desc.
10
Summary
Perceptron training rule will succeed if
• Training examples are linearly separable
• Sufficiently small learning rate η
Linear unit training uses gradient descent (delta rule)
• Guaranteed to converge to hypothesis with minimum squared error
• Given sufficiently small learning rate η
• Even when training data contains noise
• Even when training data not H separable
K Outputs Regression:
d
y i = ∑ w ij x j + w i 0 = w iT x
j =1
Classification:
o i = w iT x
exp o i
yi =
∑k exp o k
choose C i
if y i = max y k
k
11
Training
• Online (instances seen one by one) vs batch (whole sample) learning:
– No need to store the whole sample
– Problem may change in time
– Wear and degradation in system components
• Stochastic gradient-descent: Update after a single pattern
• Generic update rule (LMS rule):
(
∆w ijt = η rit − y it x tj )
Update = LearningFa ctor ⋅( DesiredOut put − ActualOutp ut ) ⋅Input
( ) (
1 2 1
) [ (
Et w | xt ,r t = r t −y t = r t − wT xt
2 2
)] 2
( )
∆wtj = ηr t −y t xtj
12
Classification
• Single sigmoid output
y t = sigmoid(w T x t )
yt =
exp w iT x t
( )
E t {w i }i | x t , r t = −∑ rit log y it
∑k exp w kT x t i
(
∆w ijt = η r it − y it x tj )
CS 536 – Artificial Neural Networks - - 25
13
XOR
w0 ≤0
w2 + w0 >0 (Minsky and Papert, 1969)
w1 + w0 >0
w1 + w2 + w0 ≤0
Multilayer Perceptrons
H
y i = v iT z = ∑ v ih z h + v i 0
h =1
z h = sigmoid w hT x ( )
1
=
1 + exp −[ (∑ d
j =1
w hj x j + w h 0 )]
14
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
CS 536 – Artificial Neural Networks - - 29
Sigmoid Activation
15
CS 536 – Artificial Neural Networks - - 31
= Σd (td-od) (-∂od/∂wi)
16
Even more…
But we know:
∂od/∂netd
=∂σ(netd)/∂netd = od (1- od )
∂netd/∂wi = ∂(w · xd) /∂wi = xi,d
So:
Backpropagation
H
y i = v iT z = ∑ v ih z h + v i 0
h =1
z h = sigmoid w hT x ( )
1
=
1 + exp − [ (∑ d
j =1
w hj x j + w h 0 )]
∂E ∂E ∂y i ∂z h
=
∂w hj ∂y i ∂z h ∂w hj
17
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, Do
• For each training example, Do
1. Input the training example to the network and compute the network
outputs
2. For each output unit k
δk = ok(1 - ok)(tk-ok)
3. For each hidden unit h
Regression E (W, v | X ) =
1
∑
2 t
rt −yt ( ) 2
H
y = ∑vhzht +v0
t
(
∆v h = ∑ r t − y t z ht )
t
h =1
Backward
∂E
Forward ∆w hj = −η
∂w hj
zh = sigmoidwhT x ( ) = −η∑
∂E ∂y t ∂z ht
t ∂y t ∂z ht ∂w hj
( )
= −η∑ − r t − y t v h z ht 1 − z ht x tj ( )
t
x ( )
= η∑ r t − y t v h z ht 1 − z ht x tj ( )
t
CS 536 – Artificial Neural Networks - - 36
18
Regression with Multiple Outputs
E(W, V| X) =
1
∑∑
2t i
rit −yit ( )
2
yi
H
y = ∑vihzht +vi0
t
i
vih
h =1
∆vih = η∑r −y z ( i
t t
i ) t
h whj
zh
t
xj
∆whj = η∑∑
t t t
ri − y i vih (
z h 1 − z t t
h xj ) ( )
t i
CS 536 – Artificial Neural Networks - - 37
19
CS 536 – Artificial Neural Networks - - 39
whx+w0
vhzh
zh
20
Two-Class Discrimination
• One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt
H t
y = sigmoid∑vhzh +v0
t
h=1
E(W,v | X) = −∑r t logy t + 1 −r t log 1 −y t ( ) ( )
t
( )
∆vh =η∑r t −y t zht
t
∆whj =η∑(r −y )v z (1 −z )x
t t t
h h
t
h
t
j
t
K>2 Classes
exp o it
( )
H
o it = ∑ v ih z ht + v i 0 y it = ≡ P Ci | xt
h =1 ∑k exp o kt
(
∆v ih = η∑ r it − y it z ht )
t
(
)
∆w hj = η∑ ∑ rit − y it v ih z ht 1 − z ht x tj ( )
t i
21
Multiple Hidden Layers
• MLP with one hidden layer is a universal approximator (Hornik et al.,
1989), but using multiple layers may lead to simpler networks
d
( )
z 1h = sigmoid w 1Th x = sigmoid ∑ w 1hj x j + w 1h 0 , h = 1 ,..., H 1
j =1
( )
H1
z 2 l = sigmoid w 2Tl x = sigmoid ∑ w 2 lh z 1h + w 2 l 0 , l = 1 ,..., H 2
h =1
H2
y = v T z 2 = ∑ v l z 2l + v 0
l =1
Improving Convergence
• Momentum
∂Et
∆w = −η +α∆wit −1
t
i
• Adaptive learning rate
∂wi
+a if Et +τ < Et
∆η=
−bη otherwise
22
Overfitting/Overtraining
23
Structured MLP
Weight Sharing
24
Convolutional neural networks
• Also known as gradient-based learning
• Template matching using NN classifiers seems to work
• Natural features are filter outputs
– probably, spots and bars, as in texture
– but why not learn the filter kernels, too?
• a perceptron approximates convolution.
• Network architecture: Two types of layers
– Convolution layers: convolving the image with filter kernels to obtain filter maps
– Subsampling layers: reduce the resolution of the filter maps
– The number of filter maps increases as the resolution decreases
25
LeNet is used to classify handwritten digits. Notice that the
test error rate is not the same as the training error rate, because
the test set consists of items not in the training set. Not all
classification schemes necessarily have small test error when they
have small training error. Error rate 0.95% on MNIST database
∂E
∆w i = −η − λw i
∂w i
λ
E' = E + ∑
2 i
w i2
26
Bayesian Learning
p (X | w ) p (w )
p (w | X ) = ˆ MAP = arg max logp (w | X )
w
p (X ) w
logp (w | X ) = logp (X | w ) + logp (w ) + C
• Consider weights wi as random vars, prior p(wi)
wi2
p (w ) = ∏ p (wi ) where p (wi ) = c ⋅ exp −
i 2(1 / 2λ )
2
E' = E + λ w
Dimensionality Reduction
27
CS 536 – Artificial Neural Networks - - 56
Learning Time
• Applications:
– Sequence recognition: Speech recognition
– Sequence reproduction: Time-series prediction
– Sequence association
• Network architectures
– Time-delay networks (Waibel et al., 1989)
– Recurrent networks (Rumelhart et al., 1986)
28
Time-Delay Neural Networks
Recurrent Networks
29
Unfolding in Time
30
Architecture of the complete system: they use another neural
net to estimate orientation of the face, then rectify it. They
search over scales to find bigger/smaller faces.
Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S.
Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright
1998, IEEE
CS 536 – Artificial Neural Networks - - 62
31
Sources
• Slides by Ethem Elpaydin, “introduction to machine learning” © The
MIT Press, 2004
• Slides by Tom M. Mitchell
• Ethem Elpaydin, “introduction to machine learning” Chapter 10
• Tom M. Mitchell “Machine Learning”
32