CS536: Machine Learning Artificial Neural Networks

CS536: Machine Learning
Artificial Neural Networks
Fall 2005
Ahmed Elgammal
Dept of Computer Science
Rutgers University
Neural Networks
Biological Motivation: Brain
• Networks of processing units (neurons) with connections (synapses)
between them
• Large number of neurons: 1011
• Large connectitivity: each connected to, on average, 104 others
• Switching time 10-3 second
• Parallel processing
• Distributed computation/memory
• Processing is done by neurons
and the memory is in the synapses
• Robust to noise, failures
CS 536 – Artificial Neural Networks - - 2
1
Neural Networks
Characteristic of Biological Computation
• Massive Parallelism
• Locality of Computation
• Adaptive (Self Organizing)
• Representation is Distributed
Understanding the Brain

• Levels of analysis (Marr, 1982)
1. Computational theory
2. Representation and algorithm
3. Hardware implementation
• Reverse engineering: From hardware to theory
• Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local memory
Learning: Update by training/experience
2
• ALVINN system
Pomerleau (1993)
• Many successful
examples:
Speech phoneme
recognition [Waibel]
Image classification
[Kanade, Baluja,
Rowley]
Financial prediction
Backgammon [Tesauro]
When to use ANN
• Input is high-dimensional discrete or real-valued (e.g. raw sensor input).

Inputs can be highly correlated or independent.
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data. Data may contain errors
• Form of target function is unknown
• Long training time are acceptable
• Fast evaluation of target function is required
• Human readability of learned target function is unimportant
⇒ ANN is much like a black-box
3
Perceptron
d
y = ∑w j x j + w 0 = w T x
j =1
w = [w 0 , w 1 ,..., w d ]
T
x = [1 , x 1 ,..., x d ]
T
(Rosenblatt, 1962)
Perceptron
Activation Rule:
Linear threshold (step unit)
Or, more succinctly: o(x) = sgn(w ⋅ x)
4
What a Perceptron Does
• 1 dimensional case:
• Regression: y=wx+w0 • Classification: y=1(wx+w0>0)
y y
s y
w0 w0
w w
x
w0
x x
x0=+1
Perceptron Decision Surface
Perceptron as hyperplane decision surface in the n-dimensional

input space
The perceptron outputs 1 for instances lying on one side of the
hyperplane and outputs –1 for instances on the other side
Data that can be separated by a hyperplane: linearly separable
5
Perceptron Decision Surface
A single unit can represent some useful functions

• What weights represent
g(x1, x2) = AND(x1, x2)? Majority, Or
But some functions not representable
• e.g., not linearly separable
• Therefore, we'll want networks of these...
Perceptron training rule

wi ← wi +∆ wi
where
∆ wi = η (t-o) xi
Where:
• t = c(x) is target value
• o is perceptron output
• η is small constant (e.g., .1) called the learning rate (or step size)
6
Perceptron training rule
Can prove it will converge
• If training data is linearly separable
• and η sufficiently small
• Perceptron Conversion Theorem (Rosenblatt): if the data are linearly
separable then the perceptron learning algorithm converges in finite
time.
Gradient Descent – Delta Rule

Also know as LMS (least mean squares) rule or widrow-Hoff rule.
To understand, consider simpler linear unit, where
o = w0 + w1x1 + … + wnxn
Let's learn wi's to minimize squared error
E[w] ≡ 1/2 Σd in D (td-od)2

Where D is set of training examples
7
Error Surface
Gradient Descent
Gradient
∇E [w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn]
When interpreted as a vector in weight space, the gradient specifies the
direction that produces the steepest increase in E
Training rule:
∆w = -η ∇E [w]
in other words:
∆wi = -η ∂E/∂wi
This results in the following update rule:
∆wi = η Σd (td-od) (xi,d)
8
Gradient of Error
∂E/∂wi
= ∂/∂wi 1/2 Σd (td-od)2
= 1/2 Σd ∂/∂wi (td-od)2
= 1/2 Σd 2 (td-od) ∂/∂wi (td-od)
= Σd (td-od) ∂/∂wi (td-w xd)
= Σd (td-od) (-xi,d)
Learning Rule:
∆wi = -η ∂E/∂wi
⇒ ∆wi = η Σd (td-od) (xi,d)
Stochastic Gradient Descent

Batch mode Gradient Descent:
Do until satisfied
1. Compute the gradient ∇ED [w]
2. w ← w - ∇ED [w]
Incremental mode Gradient Descent:
Do until satisfied
• For each training example d in D
1. Compute the gradient ∇Ed [w]
2. w ← w - ∇Ed [w]
9
More Stochastic Grad. Desc.
ED[w] ≡ 1/2 Σd in D (td-od)2

Ed [w] ≡ 1/2 (td-od)2
Incremental Gradient Descent can approximate Batch Gradient Descent
arbitrarily closely if η set small enough
Incremental Learning Rule: ∆wi = η (td-od) (xi,d)
Delta Rule: ∆wi = η (t-o) (xi)

δ = (t-o)
Gradient Descent Code

GRADIENT-DESCENT(training examples, η)
Each training example is a pair of the form <x, t>, where x is the vector of input values,
and t is the target output value. η is the learning rate (e.g., .05).
• Initialize each wi to some small random value
• Until the termination condition is met, Do
– Initialize each ∆wi to zero.
– For each <x, t> in training examples, Do
• Input the instance x to the unit and compute the output o
• For each linear unit weight wi, Do
∆wi ← ∆wi + η (t-o)xi
– For each linear unit weight wi, Do
wi ← wi + ∆wi
10
Summary
Perceptron training rule will succeed if
• Training examples are linearly separable
• Sufficiently small learning rate η
Linear unit training uses gradient descent (delta rule)
• Guaranteed to converge to hypothesis with minimum squared error
• Given sufficiently small learning rate η
• Even when training data contains noise
• Even when training data not H separable
K Outputs Regression:
d
y i = ∑ w ij x j + w i 0 = w iT x
j =1
y = Wx Linear Map from Rd⇒Rk
Classification:
o i = w iT x
exp o i
yi =
∑k exp o k
choose C i
if y i = max y k
k
11
Training
• Online (instances seen one by one) vs batch (whole sample) learning:
– No need to store the whole sample
– Problem may change in time
– Wear and degradation in system components
• Stochastic gradient-descent: Update after a single pattern
• Generic update rule (LMS rule):
(
∆w ijt = η rit − y it x tj )
Update = LearningFa ctor ⋅( DesiredOut put − ActualOutp ut ) ⋅Input
Training a Perceptron: Regression

• Regression (Linear output):
( ) (
1 2 1
) [ (
Et w | xt ,r t = r t −y t = r t − wT xt
2 2
)] 2
( )
∆wtj = ηr t −y t xtj
12
Classification
• Single sigmoid output
y t = sigmoid(w T x t )
• K>2 softmax outputs

E t (w | x t , r t ) = − r t log y t − (1 − r t ) log (1 − y t )
∆wtj = η (r t − y t )x tj
yt =
exp w iT x t
( )
E t {w i }i | x t , r t = −∑ rit log y it
∑k exp w kT x t i
(
∆w ijt = η r it − y it x tj )
Learning Boolean AND
13
XOR
• No w0, w1, w2 satisfy:
w0 ≤0
w2 + w0 >0 (Minsky and Papert, 1969)
w1 + w0 >0
w1 + w2 + w0 ≤0
Multilayer Perceptrons
H
y i = v iT z = ∑ v ih z h + v i 0
h =1
z h = sigmoid w hT x ( )
1
=
1 + exp −[ (∑ d
j =1
w hj x j + w h 0 )]
(Rumelhart et al., 1986)
14
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Sigmoid Activation
σ(x) is the sigmoid (s-like) function

1/(1 + e-x)
Nice property:
d σ(x)/dx = σ(x) (1-σ(x))
Other variations:
σ(x) = 1/(1 + e-k.x) where k is a positive constant that determines the
steepness of the threshold.
15
Error Gradient for Sigmoid

∂E/∂wi
= ∂/∂wi 1/2 Σd (td-od)2
= 1/2 Σd ∂/∂wi (td-od)2
= 1/2 Σd 2 (td-od) ∂/∂wi (td-od)
= Σd (td-od) (-∂od/∂wi)
= - Σd (td-od) (∂od/∂netd ∂netd/∂wi)
16
Even more…
But we know:
∂od/∂netd
=∂σ(netd)/∂netd = od (1- od )
∂netd/∂wi = ∂(w · xd) /∂wi = xi,d
So:
∂E/∂wi = - Σd (td-od)od (1 - od) xi,d
Backpropagation
H
y i = v iT z = ∑ v ih z h + v i 0
h =1
z h = sigmoid w hT x ( )
1
=
1 + exp − [ (∑ d
j =1
w hj x j + w h 0 )]
∂E ∂E ∂y i ∂z h
=
∂w hj ∂y i ∂z h ∂w hj
17
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satisfied, Do
• For each training example, Do
1. Input the training example to the network and compute the network
outputs
2. For each output unit k
δk = ok(1 - ok)(tk-ok)
3. For each hidden unit h
δh = oh(1 - oh) Σk in outputs wh,k δk

4. Update each network weight wi,j
wi,j ← wi,j + ∆wi,j where ∆wi,j = η δjai
Regression E (W, v | X ) =
1
∑
2 t
rt −yt ( ) 2
H
y = ∑vhzht +v0
t
(
∆v h = ∑ r t − y t z ht )
t
h =1
Backward
∂E
Forward ∆w hj = −η
∂w hj
zh = sigmoidwhT x ( ) = −η∑
∂E ∂y t ∂z ht
t ∂y t ∂z ht ∂w hj
( )
= −η∑ − r t − y t v h z ht 1 − z ht x tj ( )
t
x ( )
= η∑ r t − y t v h z ht 1 − z ht x tj ( )
t
18
Regression with Multiple Outputs
E(W, V| X) =
1
∑∑
2t i
rit −yit ( )
2
yi
H
y = ∑vihzht +vi0
t
i
vih
h =1
∆vih = η∑r −y z ( i
t t
i ) t
h whj
zh
t
xj
∆whj = η∑∑
 t t t
 ri − y i vih (
z h 1 − z t t
h xj ) ( )
t i 
19
whx+w0
vhzh
zh
20
Two-Class Discrimination
• One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt
H t 
y = sigmoid∑vhzh +v0 
t
h=1 
E(W,v | X) = −∑r t logy t + 1 −r t log 1 −y t ( ) ( )
t
( )
∆vh =η∑r t −y t zht
t
∆whj =η∑(r −y )v z (1 −z )x
t t t
h h
t
h
t
j
t
K>2 Classes
exp o it
( )
H
o it = ∑ v ih z ht + v i 0 y it = ≡ P Ci | xt
h =1 ∑k exp o kt
E (W, v | X ) = −∑∑ rit log y it

t i
(
∆v ih = η∑ r it − y it z ht )
t

(

)
∆w hj = η∑ ∑ rit − y it v ih z ht 1 − z ht x tj ( )
t  i 
21
Multiple Hidden Layers
• MLP with one hidden layer is a universal approximator (Hornik et al.,
1989), but using multiple layers may lead to simpler networks
 d 
( )
z 1h = sigmoid w 1Th x = sigmoid  ∑ w 1hj x j + w 1h 0 , h = 1 ,..., H 1
 j =1 
 
( )
H1
z 2 l = sigmoid w 2Tl x = sigmoid  ∑ w 2 lh z 1h + w 2 l 0 , l = 1 ,..., H 2
 h =1 
H2
y = v T z 2 = ∑ v l z 2l + v 0
l =1
Improving Convergence
• Momentum
∂Et
∆w = −η +α∆wit −1
t
i
• Adaptive learning rate
∂wi
 +a if Et +τ < Et
∆η= 
−bη otherwise
22
Overfitting/Overtraining
Number of weights: H (d+1)+(H+1)*K
23
Structured MLP
(Le Cun et al, 1989)
Weight Sharing
24
Convolutional neural networks
• Also known as gradient-based learning
• Template matching using NN classifiers seems to work
• Natural features are filter outputs
– probably, spots and bars, as in texture
– but why not learn the filter kernels, too?
• a perceptron approximates convolution.
• Network architecture: Two types of layers
– Convolution layers: convolving the image with filter kernels to obtain filter maps
– Subsampling layers: reduce the resolution of the filter maps
– The number of filter maps increases as the resolution decreases
A convolutional neural network, LeNet; the layers filter, subsample, filter,

subsample, and finally classify based on outputs of this process.
Figure from “Gradient-Based Learning Applied to Document
Recognition”, Y. Lecun et al Proc. IEEE, 1998 copyright 1998, IEEE
25
LeNet is used to classify handwritten digits. Notice that the
test error rate is not the same as the training error rate, because
the test set consists of items not in the training set. Not all
classification schemes necessarily have small test error when they
have small training error. Error rate 0.95% on MNIST database
Figure from “Gradient-Based Learning Applied to Document Recognition”, Y. Lecun et al

Proc. IEEE, 1998 copyright 1998, IEEE
Tuning the Network Size

• Destructive • Constructive
• Weight decay: • Growing networks
∂E
∆w i = −η − λw i
∂w i
λ
E' = E + ∑
2 i
w i2
(Ash, 1989) (Fahlman and Lebiere, 1989)
26
Bayesian Learning
p (X | w ) p (w )
p (w | X ) = ˆ MAP = arg max logp (w | X )
w
p (X ) w
logp (w | X ) = logp (X | w ) + logp (w ) + C
• Consider weights wi as random vars, prior p(wi)
 wi2 
p (w ) = ∏ p (wi ) where p (wi ) = c ⋅ exp − 
i  2(1 / 2λ ) 
2
E' = E + λ w
• Weight decay, ridge regression, regularization

cost=data-misfit + λ complexity
Dimensionality Reduction
27
Learning Time
• Applications:
– Sequence recognition: Speech recognition
– Sequence reproduction: Time-series prediction
– Sequence association
• Network architectures
– Time-delay networks (Waibel et al., 1989)
– Recurrent networks (Rumelhart et al., 1986)
28
Time-Delay Neural Networks
Recurrent Networks
29
Unfolding in Time
The vertical face-finding part of Rowley, Baluja and Kanade’s system

Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley,
S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998,
copyright 1998, IEEE
30
Architecture of the complete system: they use another neural
net to estimate orientation of the face, then rectify it. They
search over scales to find bigger/smaller faces.
Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S.
Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright
1998, IEEE
Figure from “Rotation

invariant neural-network
based face detection,” H.A.
Rowley, S. Baluja and T.
Kanade, Proc. Computer
Vision and Pattern
Recognition, 1998, copyright
1998, IEEE
31
Sources
• Slides by Ethem Elpaydin, “introduction to machine learning” © The
MIT Press, 2004
• Slides by Tom M. Mitchell
• Ethem Elpaydin, “introduction to machine learning” Chapter 10
• Tom M. Mitchell “Machine Learning”
32

CS536: Machine Learning Artificial Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS536: Machine Learning Artificial Neural Networks

Uploaded by

Copyright:

Available Formats

CS536: Machine Learning

Artificial Neural Networks

CS 536 – Artificial Neural Networks - - 2

CS 536 – Artificial Neural Networks - - 3

Understanding the Brain

CS 536 – Artificial Neural Networks - - 4

CS 536 – Artificial Neural Networks - - 5

When to use ANN

• Input is high-dimensional discrete or real-valued (e.g. raw sensor input).

CS 536 – Artificial Neural Networks - - 6

CS 536 – Artificial Neural Networks - - 7

Or, more succinctly: o(x) = sgn(w ⋅ x)

CS 536 – Artificial Neural Networks - - 8

CS 536 – Artificial Neural Networks - - 9

Perceptron Decision Surface

Perceptron as hyperplane decision surface in the n-dimensional

CS 536 – Artificial Neural Networks - - 10

A single unit can represent some useful functions

Perceptron training rule

CS 536 – Artificial Neural Networks - - 12

CS 536 – Artificial Neural Networks - - 13

Gradient Descent – Delta Rule

E[w] ≡ 1/2 Σd in D (td-od)2

CS 536 – Artificial Neural Networks - - 14

CS 536 – Artificial Neural Networks - - 15

∆wi = η Σd (td-od) (xi,d)

CS 536 – Artificial Neural Networks - - 16

= ∂/∂wi 1/2 Σd (td-od)2

= 1/2 Σd ∂/∂wi (td-od)2

= 1/2 Σd 2 (td-od) ∂/∂wi (td-od)

= Σd (td-od) ∂/∂wi (td-w xd)

⇒ ∆wi = η Σd (td-od) (xi,d)

CS 536 – Artificial Neural Networks - - 17

Stochastic Gradient Descent

CS 536 – Artificial Neural Networks - - 18

ED[w] ≡ 1/2 Σd in D (td-od)2

Incremental Learning Rule: ∆wi = η (td-od) (xi,d)

Delta Rule: ∆wi = η (t-o) (xi)

CS 536 – Artificial Neural Networks - - 19

Gradient Descent Code

CS 536 – Artificial Neural Networks - - 20

CS 536 – Artificial Neural Networks - - 21

y = Wx Linear Map from Rd⇒Rk

CS 536 – Artificial Neural Networks - - 22

CS 536 – Artificial Neural Networks - - 23

Training a Perceptron: Regression

CS 536 – Artificial Neural Networks - - 24

• K>2 softmax outputs

Learning Boolean AND

CS 536 – Artificial Neural Networks - - 26

• No w0, w1, w2 satisfy:

CS 536 – Artificial Neural Networks - - 27

(Rumelhart et al., 1986)

CS 536 – Artificial Neural Networks - - 28

σ(x) is the sigmoid (s-like) function

Error Gradient for Sigmoid

= ∂/∂wi 1/2 Σd (td-od)2

= 1/2 Σd ∂/∂wi (td-od)2

= 1/2 Σd 2 (td-od) ∂/∂wi (td-od)

= - Σd (td-od) (∂od/∂netd ∂netd/∂wi)

CS 536 – Artificial Neural Networks - - 32

∂E/∂wi = - Σd (td-od)od (1 - od) xi,d

CS 536 – Artificial Neural Networks - - 33