Lect 5

Artificial Neural Network (ANN)
Back Propagation Algorithm
Subha Fernando
Dr.Eng, M.Eng, B.Sc(Special)Hons.
Department of Computational Mathematics

University of Moratuwa
December 11, 2020
Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Artificial Neural Network (ANN) Slide 1
Faculty of Information Technology
Artificial Neural Network University of Moratuwa
Multilayer Perceptron Model
Multilayer Perceptron Algorithm

Multi Layer Perceptron is used to describe any general feed-forward network.
MLP consists of input layer, one or more hidden layer and an output layer.
The training of the network is done by the highly popular algorithm known as
error-back propagation algorithm.
There are two passes through the different layers of the network
Forward Pass (inputs are passed from input layers to output layers)
Backward Pass (errors are passed from output layers to input layers).
Lets solve the XOR problem with 2 layers.

For that consider a network with 2 inputs, 1 hidden layers and 1 output layer.
Assume that activation function is the threshold function.
Exercise 1: Initial parameters are:
W11 = W12 = W21 = W22 = W32 = +1, W31 = −2, b1 = −1.5, b2 = b3 = −0.5
Exercise 2: Initial parameters are: W11 = W12 = W21 = W22 = W32 = +1, W31 =
1, b1 = 1.5, b2 = 0.5, b3 = −0.5

Multilayer Perceptron Model University of Moratuwa
XOR Problem with 2 layers
Wij is the weight associated with the

connection from j th neuron to ith neuron.
Activation
 function is

 1 if v ≥ 0

φ (v) =

 0 if v < 0

X1 X2 XOR
0 0 0
0 1 1
1 0 1
1 1 0
Weight matrix of the input layer neurons and hidden layer neurons
     
 b1 b2  −1.5 0.5 1
     
     
W =  w
 11 w 12 
 = 
 1 1 
 lets consider the input X = 0
 
     
     
w21 w22 1 1 0

XOR Problem with 2 layers
 T  
−1.5 0.5 1
   
h i   
Y1H = Φ W1
T × X = Φ 
1  1  × 0
1   
  
 
 
  
1 1 0
  
1
 T      

 −1.5 1 1 H
−1.5 0 y 
 
Y1H = Φ   =   =  1 
  
  × 0 = Φ 
  
H
      
 0.5 1 1   0.5 1 y2
  
0
Weight matrix of the hidden layer neurons and the output layer neuron
 T  
 b3   1 
h i     
Y1O = Φ W1 T ×X  
1 = Φ 
 w  × y H 

 31 
  1  
   
w32 H
y2
  
 1
  
Y1O = Φ  −0.5 × 0 = Φ [−0.5] = 0
  
 −2 1  
  
1
Similarly by classifying the other inputs, we can show that XOR problem can be solved using Multilayer Perceptron Algorithm

BP Algorithm General Procedure
The Activation function is Sigmoid function:

1
Φ (v) = 1+exp(−v)
The first derivation of the function is :
Φ−1 (v) = Φ(v)(1 − Φ(v))
Do the forward pass,

i.e. calculate the output of each neuron i using
yi = Φ W T × X
Calculate the local gradients for the neurons , i.e. δi
Adjust the weights of the network using learning rule, i.e.Wij .

How to Calculate Local Gradients
δo1 = Φ1 ((1 × b3 ) + (y1 × w31 ) + (y2 × w32 )) ×

(d3 − y3 )
δo2 = Φ1 ((1 × b4 ) + (y1 × w41 ) + (y2 × w42 )) ×
(d4 − y4 )
δh1 = Φ1 ((1 × b1 ) + (x1 × w11 ) + (x2 × w12 )) ×
((δ01 × w31 ) + (δ02 × w41 ))
δh2 = Φ1 ((1 × b2 ) + (x1 × w21 ) + (x2 × w22 )) ×
((δ01 × w32 ) + (δ02 × w42 ))

How to Adjust the Weights of the network using the Learning rule
For Output Neurons:
W0i (n + 1) =
W0i (n) + αW0i (n − 1) + ηδoi (n)Y
W31 (n + 1) =
W31 (n) + αW31 (n − 1) + ηδo1 (n) × y1
W32 (n + 1) =
W32 (n) + αW32 (n − 1) + ηδo1 (n) × y2
W41 (n + 1) =
W41 (n) + αW41 (n − 1) + ηδo2 (n) × y1
W42 (n + 1) =
W42 (n) + αW42 (n − 1) + ηδo2 (n) × y2
For Hidden Neurons:
For Bias Terms Wij (n + 1) =
bi (n+1) = bi (n)+αbi (n−1)+ηδi (n)×1 Wij (n) + αWij (n − 1) + ηδhi (n)xj
b3 (n + 1) = W11 (n + 1) =
b3 (n) + αb3 (n − 1) + ηδ01 (n) × 1 W11 (n) + αW11 (n − 1) + ηδh1 (n) × x1
b4 (n + 1) = W12 (n + 1) =
b4 (n) + αb4 (n − 1) + ηδ02 (n) × 1 W12 (n) + αW12 (n − 1) + ηδh1 (n) × x2
b2 (n + 1) = W21 (n + 1) =
b2 (n) + αb2 (n − 1) + ηδh2 (n) × 1 W21 (n) + αW21 (n − 1) + ηδh2 (n) × x1
b1 (n + 1) = W22 (n + 1) =
b1 (n) + αb1 (n − 1) + ηδh1 (n) × 1 W22 (n) + αW22 (n − 1) + ηδh2 (n) × x2
BP- Example
Output Calculations:
v1 = 1 × b1 + x1 × w11 + x2 × w12
y1 = Φ(v1 )
v2 = 1 × b2 + x1 × w21 + x2 × w22
y2 = Φ(v2 )
v3 = 1 × b3 + y1 × w31 + y2 × w32
y3 = Φ(v3 )
Therefore e3 = d3 − y3, in order to reduce the
error, the error will be back propagated and update
the weight matrix.
Gradients Calculations:
δo1 = Φ1 (v3 ) × e3 = Φ(v3 ) (1 − Φ(v3 )) × e3
δh1 = Φ1 (v1 ) × (δ01 × w31 )
δh2 = Φ1 (v2 ) × (δ01 × w32 )
Weight Calculations:
d3 = 0.9 and η = 0.25 and α = 0.0001 w31 (2) = w31 (1) + α × w31 (0) + ηδo1 (1) × y1 ;
Take at the first step: w31 (0) = w31 (1)
.....................
Draw the updated network after first training step.

Back Propagation Algorithm- Theory

In the Single Layer Perceptron algorithm, we used gradient descent on the error
function to find the correct weights ∆w(n) = η[d − y]X(n)
∆wij (n) = η[di − yi ]Xj (n)
We see that errors/updates are local to the node i.e. the change in the weight
from node j to output i (wij ) is controlled by the input that travels along the
connection and the error signal from output i.
But the problem is how to calculate the weight changes to hidden layer when the
output is calculated only for output layer neurons?
Back Propagation has two phases:
Forward pass phase: computes ’functional signal’, feedforward propagation of input
pattern signals through network.
Backward pass phase: computes ’error signal’, propagates the error backwards through
network starting at output units (where the error is the difference between actual and
desired output values).

Consider multi-layer network with one hidden layer, where nodes in input layer are
index by i values, nodes in hidden layer are indexed by j values, and nodes in
output layer are indexed by k values.
Then weights between input and hidden layers are symbolized by wji and the
weights between hidden and output layer neurons are symbolized by wkj .
Consider the network, if gradient descent approach is used to update weights, the
objective of the learning is to modify the weight matrices to reduce a sum of
square error E = k (dk − yk )2
P


The error signal at the output of neuron k at iteration n is defined by
ek (n) = dk (n) − yk (n) − − − −(1)
Therefore error energy for neuron k is
E = 21 e2k (n)
ThusP total energyof the overall neurons in the output layer is
E = k 12 e2k (n) − − − −(2)
The induced local field vk (n) produced at the input of the activation function
associatedPwith neuron k is therefore,
vk (n) = j=0 wkj (n)yj (n) − − − − − − − − − − − (3)
yk (n) = φ (vk (n)) − − − − − − − − − − − − − − − (4)
The back propagation algorithm applies a correction to the synaptic weight ∆wkj
∂E(n)
which is proportional to the partial derivative ∂wkj (n)
.
∂E(n) ∂En ∂e (n)

∂wkj (n)
= . k
∂ek (n) ∂wkj (n)
∂E(n) ∂En ∂e (n) ∂yk (n)
∂wkj (n)
= . k .
∂ek (n) ∂yk (n) ∂wkj (n)
∂E(n) ∂En ∂ek (n) ∂yk (n) ∂vk (n)
∂wkj (n)
= . . .
∂ek (n) ∂yk (n) ∂vk (n) ∂wkj (n)

∂E(n) ∂e (n) ∂y (n) ∂v (n)
∂w (n)
= ∂e∂E(n)
n
. ∂yk (n) . ∂vk (n) . ∂w k (n)
kj k k k kj
1 2
∴ ∂e∂E(n)
P n
from (2) E = k e (n)
2 k
, = ek (n) − − − (5)
k
∂ek (n)
from (1) ek (n) = dk (n) − yk (n), ∴ ∂y (n) = −1 − − − (6)
k
∂y (n)
from (4) yk (n) = φ (vk (n)) , ∴ ∂vk (n) = φ1 (vk (n)) − − − − − −(7)
k
P ∂v (n)
from (3) vk (n) = j=0 wkj (n)yj (n), ∴ ∂w k (n) = yj (n) − − − − − − − − − −(8)
kj
∂E(n)
∂wkj (n)
= −ek (n)yj (n)φ1 (vk (n))
The correction ∆wkj (n) applied to wkj (n) is defined by the delta rule:
∂E(n)
∆wkj (n) = −η w where η is the learning rate parameter of the back propagation
kj (n)
algorithm (The use of minus sign is to seek a direction for weight change that reduces
the value of E(n))
∆wkj (n) = ηδk (n)yj (n)
where δk (n) = ek (n)φ1 (vk (n))


∂E(n)
For δk (n) = ek (n)φ1 (vk (n)); we can show that: δk (n) = ∂vk (n)
For hidden neurons
∂E(n) ∂y (n)
δj (n) = ∂y (n) . ∂vj (n)
j j
Differentiate (2) with respect to the function signal yj (n) we get, from (2),
E = k 12 e2k (n)
P
∂E P ∂e (n)
∂yj (n)
= k ek (n) ∂yk (n)
j
Again using the chain rule,
∂E P ∂e (n) ∂v (n)
∂y (n)
= k ek (n) ∂vk (n) . ∂yk (n)
j k j
∂e (n)
from (1), ek (n) = dk (n) − yk (n) and yk (n) = φ (vk (n)), ∴ ∂vk (n) = φ1 (vk (n))
k
P ∂v (n)
from (3), vk (n) = j=0 wkj (n)yj (n), ∴ ∂yk (n) = wkj (n)
j
∂e (n) ∂v (n)
Now substitute the values to ∂y∂E(n) = k ek (n) ∂vk (n) . ∂yk (n)
P
j k j
∂E P 1 (v (n))w (n) and y (n) = φ(v (n))
∂yj (n)
= k ek (n)φ k kj j j
∂E(n) ∂y (n)
So, δj (n) = − ∂y (n) . ∂vj (n) = φ1 (vj (n) k δk (n)wkj (n)
P
j j

Back Propagation Algorithm- Summary

The local gradient for the output neuron is:
δk (n) = ek (n)φ1 (vk (n))
The weight updating rule for the output neuron is:

wji (n + 1) = wji (n) + αwji (n − 1)δk (n)yj (n)
The local gradientPfor the hidden neuron is:

δj (n) = φ1 (vj (n) k δk (n)wkj (n)
The weight updating rule for the hidden neuron is:

wji (n + 1) = wji (n) + αwji (n − 1)δj (n)yi (n)

Properties of BP
Advantageous and Limitations of BP
The back propagation algorithm applies
a correction δwij (n) to the synaptic
weight wij (n) which is proportional to
∂Error
∂w
.
ij
Advantageous:
Relatively simple implementation
A Standard method and generally works
well
Limitations:
Slow and inefficient
It can get stuck in local minima resulting in
sub-optimal solutions

Properties of BP - Momentum α
The effect of Learning rate η

The learning rate coefficient determines
the size of the weight adjustments made at
each iteration and hence influences the
rate of convergence.
Poor choice of the coefficient constant can
Add percentage of the last result in a failure in convergence.
movement to the current If the learning rate is too large, the search
movement path will oscillate and converges more
Useful to get over small bumps in slowly than a direct descent.
the error function (i.e To smooth If the coefficient is too small, the descent
out the descent path by preventing will progress in a small steps significantly
extreme changes in the gradients increasing the time to converge.
due to local anomalies).
Often finds a minimum in less
steps
BP with Minibatch Learning
Minibatch Learning
Noise Structure


Lect 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 5

Uploaded by

Copyright:

Available Formats

Artificial Neural Network (ANN)

Back Propagation Algorithm

Department of Computational Mathematics

December 11, 2020

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Multilayer Perceptron Algorithm

Lets solve the XOR problem with 2 layers.

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Wij is the weight associated with the

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

BP Algorithm General Procedure

The Activation function is Sigmoid function:

Do the forward pass,

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

How to Calculate Local Gradients

δo1 = Φ1 ((1 × b3 ) + (y1 × w31 ) + (y2 × w32 )) ×

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Back Propagation Algorithm- Theory

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Back Propagation Algorithm- Theory

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Back Propagation Algorithm- Theory

∂E(n) ∂En ∂e (n)

Back Propagation Algorithm- Theory

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Back Propagation Algorithm- Theory

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Back Propagation Algorithm- Summary

The weight updating rule for the output neuron is:

The local gradientPfor the hidden neuron is:

The weight updating rule for the hidden neuron is:

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

The effect of Learning rate η

Subha Fernando Dr.Eng, M.Eng, B.Sc(Special)Hons.,

You might also like