Learning: Perceptrons & Neural Networks: Artificial Intelligence CMSC 25000 February 8, 2007

Learning: Perceptrons &
Neural Networks
Artificial Intelligence
CMSC 25000
February 8, 2007
Roadmap
• Perceptrons: Single layer networks
• Perceptron training
• Perceptron convergence theorem
• Perceptron limitations
• Neural Networks
– Motivation: Overcoming perceptron limitations
– Motivation: ALVINN
– Heuristic Training
• Backpropagation; Gradient descent
• Avoiding overfitting
• Avoiding local minima
– Conclusion: Teaching a Net to talk
Perceptron Structure
y
w0 wn
w1
w2 w3
x0=1 x1 x2 x3 . . . xn
 n
1 if  wi xi  0
y i 0
0 otherwise

x0 w0 compensates for threshold
Perceptron Convergence
Procedure
• Straight-forward training procedure
– Learns linearly separable functions
• Until perceptron yields correct output for
all
– If the perceptron is correct, do nothing
– If the percepton is wrong,
• If it incorrectly says “yes”,
– Subtract input vector from weight vector
• Otherwise, add input vector to weight vector
Perceptron Convergence
Theorem
 n
1 if  wi xi  0
• If there exists a vector W s.t.
• Perceptron training will find it
y i 0
• Assume forall +ive examples x 0 otherwise
vx  
• ||w||^2 increases by at most ||x||^2, in each iteration
     
w  x  x  ...  x , v  w  k
• ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2
1 2 k
• v.w/||w|| > <= 1
k / x k
1 /   2 Converges in k
<= O steps
Perceptron Learning
• Perceptrons learn linear decision boundaries
• E.g.
x2 x2
0 0
0 0 + 0
+ + 0 0 But not
+
0
+ ++
0
+
x1 x1
xor
X1 X2
-1 -1 w1x1 + w2x2 < 0
1 -1 w1x1 + w2x2 > 0 => implies w1 > 0
1 1 w1x1 + w2x2 >0 => but should be false
-1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Perceptron Example
• Digit recognition
– Assume display= 8 lightable bars
– Inputs – on/off + threshold
– 65 steps to recognize “8”
Perceptron Summary
• Motivated by neuron activation
• Simple training procedure
• Guaranteed to converge
– IF linearly separable
Neural Nets
• Multi-layer perceptrons
– Inputs: real-valued
– Intermediate “hidden” nodes
– Output(s): one (or more) discrete-valued
X1
X2 Y1
X3
Y2
X4
Inputs Hidden Hidden Outputs

Neural Nets
• Pro: More general than perceptrons
– Not restricted to linear discriminants
– Multiple outputs: one classification each
• Con: No simple, guaranteed training
procedure
– Use greedy, hill-climbing procedure to train
– “Gradient descent”, “Backpropagation”
Solving the XOR Problem
o1
w11
Network x1 w13
Topology: w21 w01
y
2 hidden nodes w12 -1 w23
w03
1 output x2 w22
-1
w02 o2
Desired behavior: -1
x1 x2 o1 o2 y Weights:
0 0 0 0 0 w11= w12=1
1 0 0 1 1 w21=w22 = 1
0 1 0 1 1 w01=3/2; w02=1/2; w03=1/2
1 1 1 1 0 w13=-1; w23=1
Neural Net Applications
• Speech recognition
• Handwriting recognition
• NETtalk: Letter-to-sound rules
• ALVINN: Autonomous driving

ALVINN
• Driving as a neural network
• Inputs:
– Image pixel intensities
• I.e. lane lines
• 5 Hidden nodes
• Outputs:
– Steering actions
• E.g. turn left/right; how far
• Training:
– Observe human behavior: sample images, steering
Backpropagation
• Greedy, Hill-climbing procedure
– Weights are parameters to change
– Original hill-climb changes one parameter/step
• Slow
– If smooth function, change all parameters/step
• Gradient descent
– Backpropagation: Computes current output, works
backward to correct error
Producing a Smooth Function
• Key problem:
– Pure step threshold is discontinuous
• Not differentiable
• Solution:
– Sigmoid (squashed ‘s’ function): Logistic fn
n
1
z   wi xi s( z )  z
i 1  e
Neural Net Training
• Goal:
– Determine how to change weights to get correct
output
• Large change in weight to produce large reduction in
error
• Approach:
• Compute actual output: o
• Compare to desired output: d
• Determine effect of each weight w on error = d-o
• Adjust weights
Neural Net Example
y3
xi : ith sample input vector z3
w : weight vector
w03
yi*: desired output for ith sample
1   w13 y1 w23 y2
E -  ( yi*  F ( xi , w)) 2
-1
z1 z2
2 i
w21
w01 w22
Sum of squares error over training samples w02
-1 w11 w12 -1
x1 x2
From 6.034 notes lozano-perez
z1 z2
 
y3  F ( x , w)  s( w13 s( w11 x1  w21 x2  w01 )  w23 s( w12 x1  w22 x2  w02 )  w03 )
z3
Full expression of output in terms of input and weights
Gradient Descent
• Error: Sum of squares error of inputs with

current weights
• Compute rate of change of error wrt each
weight
– Which weights have greatest effect on error?
– Effectively, partial derivatives of error wrt
weights
• In turn, depend on other weights => chain rule
Gradient Descent
• E = G(w) dG
E dw
– Error as function of
weights
• Find rate of change of G(w)
error
– Follow steepest rate of
w0w1 w
change Local
– Change weights s.t. error minima
is minimized
1
Gradient
 
of Error
E -  ( yi*  F ( xi , w)) 2
2 i
z1 z2
 
y3  F ( x , w)  s( w13 s( w11 x1  w21 x2  w01 )  w23 s( w12 x1  w22 x2  w02 )  w03 )
z3 y3
E y3
  ( y i  y3 )
* z3
w j w j w03
Note: Derivative of sigmoid: -1 w13 y1 w23 y2

z1 z2
ds(z1) = s(z1)(1-s(z1))
w21
dz1 w01 w22 w02
y3 s ( z3 ) z3 s ( z3 ) s ( z3 ) -1 w11 w12 -1
  s ( z1 )  y1 x1 x2
w13 z3 w13 z3 z3 From 6.034 notes lozano-perez
y3 s( z3 ) z3 s ( z3 ) s ( z1 ) z1 s( z3 ) s( z1 )

  w13  w13 x1
w11 z3 w11 z3 z1 w11 z3 z1
MIT AI lecture notes, Lozano-Per
ez 2000
From Effect to Update
• Gradient computation:
– How each weight contributes to performance
• To train:
– Need to determine how to CHANGE weight
based on contribution to performance
– Need to determine how MUCH change to make
per iteration
• Rate parameter ‘r’
– Large enough to learn quickly
– Small enough reach but not overshoot target values
Backpropagation Procedure
wi  j w j k
i j k
• Pick rate parameter ‘r’ oi oj
o (1  o ) j j ok (1  ok )
• Until performance is good enough,
– Do forward computation to calculate output
– Compute Beta in output node with
 z  d z  oz
– Compute Beta in all other nodes with
 j   w j k ok (1  ok )  k
k
– Compute change for all weights with
wi  j  roi o j (1  o j )  j
y3
Backprop Example z3
w03
Forward prop: Compute zi and yi given xk, wl
-1 w13 y1 w23 y2
z2
 3  ( y3*  y3 ) z1
w21
w01
 2  y3 (1  y3 )  3 w23 w22 w02
-1 w11 w12
1  y3 (1  y3 )  3 w13 x1 x2
-1
w03  w03  ry3 (1  y3 )  3 (1)

w02  w02  ry2 (1  y2 )  2 (1)
w01  w01  ry1 (1  y1 ) 1 (1)
w13  w13  ry1 y3 (1  y3 )  3 w23  w23  ry2 y3 (1  y3 )  3
w12  w12  rx1 y2 (1  y2 )  2 w22  w22  rx2 y2 (1  y2 )  2
w11  w11  rx1 y1 (1  y1 ) 1 w21  w21  rx2 y1 (1  y1 ) 1
Backpropagation Observations
• Procedure is (relatively) efficient
– All computations are local
• Use inputs and outputs of current node
• What is “good enough”?

– Rarely reach target (0 or 1) outputs
• Typically, train until within 0.1 of target
Neural Net Summary
• Training:
– Backpropagation procedure
• Gradient descent strategy (usual problems)
• Prediction:
– Compute outputs based on input vector & weights
• Pros: Very general, Fast prediction
• Cons: Training can be VERY slow (1000’s of
epochs), Overfitting
Training Strategies
• Online training:
– Update weights after each sample
• Offline (batch training):
– Compute error over all samples
• Then update weights
• Online training “noisy”

– Sensitive to individual instances
– However, may escape local minima
Training Strategy
• To avoid overfitting:
– Split data into: training, validation, & test
• Also, avoid excess weights (less than # samples)
• Initialize with small random weights
– Small changes have noticeable effect
• Use offline training
– Until validation set minimum
• Evaluate on test set
– No more weight changes
Classification
• Neural networks best for classification
task
– Single output -> Binary classifier
– Multiple outputs -> Multiway classification
• Applied successfully to learning pronunciation
– Sigmoid pushes to binary classification

• Not good for regression
Neural Net Example
• NETtalk: Letter-to-sound by net
• Inputs:
– Need context to pronounce
• 7-letter window: predict sound of middle letter
• 29 possible characters – alphabet+space+,+.
– 7*29=203 inputs
• 80 Hidden nodes
• Output: Generate 60 phones
– Nodes map to 26 units: 21 articulatory, 5 stress/sil
• Vector quantization of acoustic space
Neural Net Example: NETtalk
• Learning to talk:
– 5 iterations/1024 training words: bound/stress
– 10 iterations: intelligible
– 400 new test words: 80% correct
• Not as good as DecTalk, but automatic

Neural Net Conclusions
• Simulation based on neurons in brain
• Perceptrons (single neuron)
– Guaranteed to find linear discriminant
• IF one exists -> problem XOR
• Neural nets (Multi-layer perceptrons)
– Very general
– Backpropagation training procedure
• Gradient descent - local min, overfitting issues

Learning: Perceptrons & Neural Networks: Artificial Intelligence CMSC 25000 February 8, 2007

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning: Perceptrons & Neural Networks: Artificial Intelligence CMSC 25000 February 8, 2007

Uploaded by

Copyright:

Available Formats

Learning: Perceptrons &

Inputs Hidden Hidden Outputs

• NETtalk: Letter-to-sound rules

• ALVINN: Autonomous driving

• Error: Sum of squares error of inputs with

Note: Derivative of sigmoid: -1 w13 y1 w23 y2

y3 s( z3 ) z3 s ( z3 ) s ( z1 ) z1 s( z3 ) s( z1 )

w03  w03  ry3 (1  y3 )  3 (1)

• What is “good enough”?

• Online training “noisy”

– Sigmoid pushes to binary classification

• Not as good as DecTalk, but automatic

You might also like