You are on page 1of 31

Learning: Perceptrons &

Neural Networks
Artificial Intelligence
CMSC 25000
February 8, 2007
Roadmap
• Perceptrons: Single layer networks
• Perceptron training
• Perceptron convergence theorem
• Perceptron limitations
• Neural Networks
– Motivation: Overcoming perceptron limitations
– Motivation: ALVINN
– Heuristic Training
• Backpropagation; Gradient descent
• Avoiding overfitting
• Avoiding local minima
– Conclusion: Teaching a Net to talk
Perceptron Structure
y

w0 wn
w1
w2 w3

x0=1 x1 x2 x3 . . . xn

 n

1 if  wi xi  0
y i 0
0 otherwise

x0 w0 compensates for threshold
Perceptron Convergence
Procedure
• Straight-forward training procedure
– Learns linearly separable functions
• Until perceptron yields correct output for
all
– If the perceptron is correct, do nothing
– If the percepton is wrong,
• If it incorrectly says “yes”,
– Subtract input vector from weight vector
• Otherwise, add input vector to weight vector
Perceptron Convergence
Theorem
 n

1 if  wi xi  0
• If there exists a vector W s.t.
• Perceptron training will find it
y i 0
• Assume forall +ive examples x 0 otherwise
vx  
• ||w||^2 increases by at most ||x||^2, in each iteration
     
w  x  x  ...  x , v  w  k
• ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2
1 2 k
• v.w/||w|| > <= 1

k / x k
1 /   2 Converges in k
<= O steps
Perceptron Learning
• Perceptrons learn linear decision boundaries
• E.g.
x2 x2
0 0

0 0 + 0
+ + 0 0 But not
+
0
+ ++
0
+
x1 x1
xor

X1 X2
-1 -1 w1x1 + w2x2 < 0
1 -1 w1x1 + w2x2 > 0 => implies w1 > 0
1 1 w1x1 + w2x2 >0 => but should be false
-1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Perceptron Example
• Digit recognition
– Assume display= 8 lightable bars
– Inputs – on/off + threshold
– 65 steps to recognize “8”
Perceptron Summary
• Motivated by neuron activation
• Simple training procedure
• Guaranteed to converge
– IF linearly separable
Neural Nets
• Multi-layer perceptrons
– Inputs: real-valued
– Intermediate “hidden” nodes
– Output(s): one (or more) discrete-valued

X1
X2 Y1
X3
Y2
X4

Inputs Hidden Hidden Outputs


Neural Nets
• Pro: More general than perceptrons
– Not restricted to linear discriminants
– Multiple outputs: one classification each
• Con: No simple, guaranteed training
procedure
– Use greedy, hill-climbing procedure to train
– “Gradient descent”, “Backpropagation”
Solving the XOR Problem
o1

w11
Network x1 w13
Topology: w21 w01
y
2 hidden nodes w12 -1 w23
w03
1 output x2 w22
-1
w02 o2
Desired behavior: -1

x1 x2 o1 o2 y Weights:
0 0 0 0 0 w11= w12=1
1 0 0 1 1 w21=w22 = 1
0 1 0 1 1 w01=3/2; w02=1/2; w03=1/2
1 1 1 1 0 w13=-1; w23=1
Neural Net Applications
• Speech recognition

• Handwriting recognition

• NETtalk: Letter-to-sound rules

• ALVINN: Autonomous driving


ALVINN
• Driving as a neural network
• Inputs:
– Image pixel intensities
• I.e. lane lines
• 5 Hidden nodes
• Outputs:
– Steering actions
• E.g. turn left/right; how far
• Training:
– Observe human behavior: sample images, steering
Backpropagation
• Greedy, Hill-climbing procedure
– Weights are parameters to change
– Original hill-climb changes one parameter/step
• Slow
– If smooth function, change all parameters/step
• Gradient descent
– Backpropagation: Computes current output, works
backward to correct error
Producing a Smooth Function
• Key problem:
– Pure step threshold is discontinuous
• Not differentiable
• Solution:
– Sigmoid (squashed ‘s’ function): Logistic fn

n
1
z   wi xi s( z )  z
i 1  e
Neural Net Training
• Goal:
– Determine how to change weights to get correct
output
• Large change in weight to produce large reduction in
error
• Approach:
• Compute actual output: o
• Compare to desired output: d
• Determine effect of each weight w on error = d-o
• Adjust weights
Neural Net Example
y3
xi : ith sample input vector z3
w : weight vector
w03
yi*: desired output for ith sample
1   w13 y1 w23 y2
E -  ( yi*  F ( xi , w)) 2
-1
z1 z2
2 i
w21
w01 w22
Sum of squares error over training samples w02
-1 w11 w12 -1
x1 x2
From 6.034 notes lozano-perez
z1 z2
 
y3  F ( x , w)  s( w13 s( w11 x1  w21 x2  w01 )  w23 s( w12 x1  w22 x2  w02 )  w03 )
z3
Full expression of output in terms of input and weights
Gradient Descent

• Error: Sum of squares error of inputs with


current weights
• Compute rate of change of error wrt each
weight
– Which weights have greatest effect on error?
– Effectively, partial derivatives of error wrt
weights
• In turn, depend on other weights => chain rule
Gradient Descent
• E = G(w) dG
E dw
– Error as function of
weights
• Find rate of change of G(w)
error
– Follow steepest rate of
w0w1 w
change Local
– Change weights s.t. error minima
is minimized
1
Gradient
 
of Error
E -  ( yi*  F ( xi , w)) 2
2 i
z1 z2
 
y3  F ( x , w)  s( w13 s( w11 x1  w21 x2  w01 )  w23 s( w12 x1  w22 x2  w02 )  w03 )
z3 y3
E y3
  ( y i  y3 )
* z3
w j w j w03

Note: Derivative of sigmoid: -1 w13 y1 w23 y2


z1 z2
ds(z1) = s(z1)(1-s(z1))
w21
dz1 w01 w22 w02
y3 s ( z3 ) z3 s ( z3 ) s ( z3 ) -1 w11 w12 -1
  s ( z1 )  y1 x1 x2
w13 z3 w13 z3 z3 From 6.034 notes lozano-perez

y3 s( z3 ) z3 s ( z3 ) s ( z1 ) z1 s( z3 ) s( z1 )


  w13  w13 x1
w11 z3 w11 z3 z1 w11 z3 z1
MIT AI lecture notes, Lozano-Per
ez 2000
From Effect to Update
• Gradient computation:
– How each weight contributes to performance
• To train:
– Need to determine how to CHANGE weight
based on contribution to performance
– Need to determine how MUCH change to make
per iteration
• Rate parameter ‘r’
– Large enough to learn quickly
– Small enough reach but not overshoot target values
Backpropagation Procedure
wi  j w j k
i j k
• Pick rate parameter ‘r’ oi oj
o (1  o ) j j ok (1  ok )
• Until performance is good enough,
– Do forward computation to calculate output
– Compute Beta in output node with
 z  d z  oz
– Compute Beta in all other nodes with
 j   w j k ok (1  ok )  k
k
– Compute change for all weights with
wi  j  roi o j (1  o j )  j
y3
Backprop Example z3

w03
Forward prop: Compute zi and yi given xk, wl
-1 w13 y1 w23 y2
z2
 3  ( y3*  y3 ) z1
w21
w01
 2  y3 (1  y3 )  3 w23 w22 w02
-1 w11 w12
1  y3 (1  y3 )  3 w13 x1 x2
-1

w03  w03  ry3 (1  y3 )  3 (1)


w02  w02  ry2 (1  y2 )  2 (1)
w01  w01  ry1 (1  y1 ) 1 (1)
w13  w13  ry1 y3 (1  y3 )  3 w23  w23  ry2 y3 (1  y3 )  3
w12  w12  rx1 y2 (1  y2 )  2 w22  w22  rx2 y2 (1  y2 )  2
w11  w11  rx1 y1 (1  y1 ) 1 w21  w21  rx2 y1 (1  y1 ) 1
Backpropagation Observations
• Procedure is (relatively) efficient
– All computations are local
• Use inputs and outputs of current node

• What is “good enough”?


– Rarely reach target (0 or 1) outputs
• Typically, train until within 0.1 of target
Neural Net Summary

• Training:
– Backpropagation procedure
• Gradient descent strategy (usual problems)
• Prediction:
– Compute outputs based on input vector & weights
• Pros: Very general, Fast prediction
• Cons: Training can be VERY slow (1000’s of
epochs), Overfitting
Training Strategies
• Online training:
– Update weights after each sample
• Offline (batch training):
– Compute error over all samples
• Then update weights

• Online training “noisy”


– Sensitive to individual instances
– However, may escape local minima
Training Strategy
• To avoid overfitting:
– Split data into: training, validation, & test
• Also, avoid excess weights (less than # samples)
• Initialize with small random weights
– Small changes have noticeable effect
• Use offline training
– Until validation set minimum
• Evaluate on test set
– No more weight changes
Classification
• Neural networks best for classification
task
– Single output -> Binary classifier
– Multiple outputs -> Multiway classification
• Applied successfully to learning pronunciation

– Sigmoid pushes to binary classification


• Not good for regression
Neural Net Example
• NETtalk: Letter-to-sound by net
• Inputs:
– Need context to pronounce
• 7-letter window: predict sound of middle letter
• 29 possible characters – alphabet+space+,+.
– 7*29=203 inputs

• 80 Hidden nodes
• Output: Generate 60 phones
– Nodes map to 26 units: 21 articulatory, 5 stress/sil
• Vector quantization of acoustic space
Neural Net Example: NETtalk
• Learning to talk:
– 5 iterations/1024 training words: bound/stress
– 10 iterations: intelligible
– 400 new test words: 80% correct

• Not as good as DecTalk, but automatic


Neural Net Conclusions
• Simulation based on neurons in brain
• Perceptrons (single neuron)
– Guaranteed to find linear discriminant
• IF one exists -> problem XOR
• Neural nets (Multi-layer perceptrons)
– Very general
– Backpropagation training procedure
• Gradient descent - local min, overfitting issues

You might also like