Introduction To Neural Networks: John Paxton Montana State University Summer 2003

Introduction to Neural
Networks
John Paxton
Montana State University
Summer 2003
Chapter 6: Backpropagation
1986 Rumelhart, Hinton, Williams
Gradient descent method that minimizes
the total squared error of the output.
Applicable to multilayer, feedforward,
supervised neural networks.
Revitalizes interest in neural networks!
Backpropagation
Appropriate for any domain where inputs
must be mapped onto outputs.
1 hidden layer is sufficient to learn any
continuous mapping to any arbitrary
accuracy!
Memorization versus generalization
tradeoff.
Architecture
input layer, hidden layer, output layer
1 1
y1
x1 z1
ym
xn zp
wpm
vnp
General Process
Feedforward the input signals.
Backpropagate the error.
Adjust the weights.

Activation Function Characteristics
Continuous.
Differentiable.
Monotonically nondecreasing.
Easy to compute.
Saturates (reaches limits).
Activation Functions
Binary Sigmoid
f(x) = 1 / [ 1 + e-x ]
f(x) = f(x)[1 f(x)]
Bipolar Sigmoid
f(x) = -1 + 2 / [1 + e-x]
f(x) = 0.5 * [1 + f(x)] * [1 f(x) ]
Training Algorithm
1. initialize weights to small random
values, for example [-0.5 .. 0.5]
2. while stopping condition is false do

steps 3 8
3. for each training pair do steps 4-8

Training Algorithm
4. zin.j = S (xi * vij)
zj = f(zin.j)
5. yin.j = S (zi * wij)
yj = f(yin.j)
6. error(yj) = (tj yj) * f(yin.j)
tj is the target value
7. error(zk) = [ S error(yj) * wkj ] * f(zin.k)
Training Algorithm
8. wkj(new) = wkj(old) + a*error(yj)*zk
vkj(new) = vkj(old) + a*error(zj))*xk
a is the learning rate
An epoch is one cycle through the training

vectors.
Choices
Initial Weights
random [-0.5 .. 0.5], dont want the derivative
to be 0
Nguyen-Widrow
b = 0.7 * p(1/n)
n = number of input units
p = number of hidden units
vij = b * vij(random) / || vj(random) ||
Choices
Stopping Condition (avoid overtraining!)
Set aside some of the training pairs as a

validations set.
Stop training when the error on the validation
set stops decreasing.
Choices
Number of Training Pairs
total number of weights / desired average

error on test set
where the average error on the training pairs

is half of the above desired average
Choices
Data Representation
Bipolar is better than binary because 0 units
dont learn.
Discrete values: red, green, blue?
Continuous values: [15.0 .. 35.0]?
Number of Hidden Layers
1 is sufficient
Sometimes, multiple layers might speed up
the learning
Example
XOR.
Bipolar data representation.
Bipolar sigmoid activation function.
a=1
3 input units, 5 hidden units,1 output unit
Initial Weights are all 0.
Training example (1 -1). Target: 1.
Example
4. z1 = f(1*0 + 1*0+ -1*0) = 0.5
z2 = z3 = z4 = 0.5
5. y1 = f(1*0 + 0.5*0 + 0.5*0 + 0.5*0 +
0.5*0) = 0.5
6. error(y1) = (1 0.5) * [0.5 * (1 + 0) * (1
0)] = 0.25
7. error(z1) = 0 * f(zin.1) = 0 = error(z2) =
error(z3) = error(z4)
Example
8. w01(new) = w01(old) + a*error(y1)*z0
= 0 + 1 * 0.25 * 1 = 0.25
v21(new) = v21(old) + a*error(z1)*x2

= 0 + 1 * 0 * -1 = 0.
Exercise
Draw the updated neural network.
Present the example 1 -1 as an example

to classify. How is it classified now?
If learning were to occur, how would the

network weights change this time?
XOR Experiments
Binary Activation/Binary Representation:
3000 epochs.
Bipolar Activation/Bipolar Representation:
400 epochs.
Bipolar Activation/Modified Bipolar
Representation [-0.8 .. 0.8]: 265 epochs.
Above experiment with Nguyen-Widrow
weight initialization: 125 epochs.
Variations
Momentum
D wjk(t+1) = a * error(yj) * zk + m * D wjk(t)
D vij(t+1) = similar
m is [0.0 .. 1.0]
The previous experiment takes 38 epochs.

Variations
Batch update the weights to smooth the
changes.
Adapt the learning rate. For example, in
the delta-bar-delta procedure each weight
has its own learning rate that varies over
time.
2 consecutive weight increases or decreases
will increase the learning rate.
Variations
Alternate Activation Functions
Strictly Local Backpropagation
makes the algorithm more biologically
plausible by making all computations local
cortical units sum their inputs
synaptic units apply an activation function
thalamic units compute errors
equivalent to standard backpropagation
Variations
Strictly Local Backpropagation
input cortical layer -> input synaptic layer ->

hidden cortical layer -> hidden synaptic layer ->
output cortical layer-> output synaptic layer
-> output thalamic layer
Number of Hidden Layers

Hecht-Neilsen Theorem
Given any continuous function f: In -> Rm

where I is [0, 1], f can be represented
exactly by a feedforward network having n
input units, 2n + 1 hidden units, and m
output units.

Introduction To Neural Networks: John Paxton Montana State University Summer 2003

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Neural Networks: John Paxton Montana State University Summer 2003

Uploaded by

Copyright:

Available Formats

Introduction to Neural

Backpropagate the error.

Adjust the weights.

2. while stopping condition is false do

3. for each training pair do steps 4-8

a is the learning rate

An epoch is one cycle through the training

Set aside some of the training pairs as a

total number of weights / desired average

where the average error on the training pairs

v21(new) = v21(old) + aerror(z1)x2

Present the example 1 -1 as an example

If learning were to occur, how would the

D wjk(t+1) = a * error(yj) * zk + m * D wjk(t)

The previous experiment takes 38 epochs.

input cortical layer -> input synaptic layer ->

Number of Hidden Layers

Given any continuous function f: In -> Rm

You might also like