Professional Documents
Culture Documents
Neural Networks
Artificial Intelligence
CMSC 25000
February 8, 2007
Roadmap
• Perceptrons: Single layer networks
• Perceptron training
• Perceptron convergence theorem
• Perceptron limitations
• Neural Networks
– Motivation: Overcoming perceptron limitations
– Motivation: ALVINN
– Heuristic Training
• Backpropagation; Gradient descent
• Avoiding overfitting
• Avoiding local minima
– Conclusion: Teaching a Net to talk
Perceptron Structure
y
w0 wn
w1
w2 w3
x0=1 x1 x2 x3 . . . xn
n
1 if wi xi 0
y i 0
0 otherwise
x0 w0 compensates for threshold
Perceptron Convergence
Procedure
• Straight-forward training procedure
– Learns linearly separable functions
• Until perceptron yields correct output for
all
– If the perceptron is correct, do nothing
– If the percepton is wrong,
• If it incorrectly says “yes”,
– Subtract input vector from weight vector
• Otherwise, add input vector to weight vector
Perceptron Convergence
Theorem
n
1 if wi xi 0
• If there exists a vector W s.t.
• Perceptron training will find it
y i 0
• Assume forall +ive examples x 0 otherwise
vx
• ||w||^2 increases by at most ||x||^2, in each iteration
w x x ... x , v w k
• ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2
1 2 k
• v.w/||w|| > <= 1
k / x k
1 / 2 Converges in k
<= O steps
Perceptron Learning
• Perceptrons learn linear decision boundaries
• E.g.
x2 x2
0 0
0 0 + 0
+ + 0 0 But not
+
0
+ ++
0
+
x1 x1
xor
X1 X2
-1 -1 w1x1 + w2x2 < 0
1 -1 w1x1 + w2x2 > 0 => implies w1 > 0
1 1 w1x1 + w2x2 >0 => but should be false
-1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Perceptron Example
• Digit recognition
– Assume display= 8 lightable bars
– Inputs – on/off + threshold
– 65 steps to recognize “8”
Perceptron Summary
• Motivated by neuron activation
• Simple training procedure
• Guaranteed to converge
– IF linearly separable
Neural Nets
• Multi-layer perceptrons
– Inputs: real-valued
– Intermediate “hidden” nodes
– Output(s): one (or more) discrete-valued
X1
X2 Y1
X3
Y2
X4
w11
Network x1 w13
Topology: w21 w01
y
2 hidden nodes w12 -1 w23
w03
1 output x2 w22
-1
w02 o2
Desired behavior: -1
x1 x2 o1 o2 y Weights:
0 0 0 0 0 w11= w12=1
1 0 0 1 1 w21=w22 = 1
0 1 0 1 1 w01=3/2; w02=1/2; w03=1/2
1 1 1 1 0 w13=-1; w23=1
Neural Net Applications
• Speech recognition
• Handwriting recognition
n
1
z wi xi s( z ) z
i 1 e
Neural Net Training
• Goal:
– Determine how to change weights to get correct
output
• Large change in weight to produce large reduction in
error
• Approach:
• Compute actual output: o
• Compare to desired output: d
• Determine effect of each weight w on error = d-o
• Adjust weights
Neural Net Example
y3
xi : ith sample input vector z3
w : weight vector
w03
yi*: desired output for ith sample
1 w13 y1 w23 y2
E - ( yi* F ( xi , w)) 2
-1
z1 z2
2 i
w21
w01 w22
Sum of squares error over training samples w02
-1 w11 w12 -1
x1 x2
From 6.034 notes lozano-perez
z1 z2
y3 F ( x , w) s( w13 s( w11 x1 w21 x2 w01 ) w23 s( w12 x1 w22 x2 w02 ) w03 )
z3
Full expression of output in terms of input and weights
Gradient Descent
w03
Forward prop: Compute zi and yi given xk, wl
-1 w13 y1 w23 y2
z2
3 ( y3* y3 ) z1
w21
w01
2 y3 (1 y3 ) 3 w23 w22 w02
-1 w11 w12
1 y3 (1 y3 ) 3 w13 x1 x2
-1
• Training:
– Backpropagation procedure
• Gradient descent strategy (usual problems)
• Prediction:
– Compute outputs based on input vector & weights
• Pros: Very general, Fast prediction
• Cons: Training can be VERY slow (1000’s of
epochs), Overfitting
Training Strategies
• Online training:
– Update weights after each sample
• Offline (batch training):
– Compute error over all samples
• Then update weights
• 80 Hidden nodes
• Output: Generate 60 phones
– Nodes map to 26 units: 21 articulatory, 5 stress/sil
• Vector quantization of acoustic space
Neural Net Example: NETtalk
• Learning to talk:
– 5 iterations/1024 training words: bound/stress
– 10 iterations: intelligible
– 400 new test words: 80% correct