Professional Documents
Culture Documents
Neural Networks - III: ICT3212 - Introduction To Intelligence Systems COM3303 - Artificial Intelligence
Neural Networks - III: ICT3212 - Introduction To Intelligence Systems COM3303 - Artificial Intelligence
K.A.S.H. Kulathilake
Ph.D., M.Phil., MCS, B.Sc.(Hons.)IT, SEDA(UK)
Senior Lecturer
Department of Computing
Rajarata University of Sri Lanka
kule@as.rjt.ac.lk
1
Learning Outcomes
2
Contents
• Backpropagation Algorithm
• Backpropagation
• Vanishing Gradient Problem
• Overcome Vanishing Gradient Problem
• Types of Gradient Descent Algorithms
3
In forward propagation, we propagate
4
Backpropagation Algorithm (Cont..)
• Why?
• The goal of backpropagation is to adjust the weights and biases throughout
the neural network based on the calculated loss (error) so that the loss
(error) will be lower in the next iteration.
• Ultimately, we want to find a minimum value for the loss function.
• How?
• The adjustment works by finding the gradient of the loss function through
the chain rule of calculus.
5
Backpropagation Algorithm (Cont..)
6
Backpropagation Algorithm (Cont..)
• Parameter Initialization
• In this, parameters, i.e., weights and biases, associated with an artificial
neuron are randomly initialized.
• After receiving the input, the network feeds forwards the input and it makes
associations with weights and biases to give the output.
• The output associated with those random values is most probably not correct.
7
Backpropagation Algorithm (Cont..)
• Forward Propagation
• After initialization, when the input is given to the
input layer, it propagates the input into hidden
units at each layer.
• The nodes here do their job without being aware
of whether the results produced are accurate or
not (i.e., they don’t re-adjust according to the
results produced).
• Then, finally, the output is produced at the output 𝑍 =𝑊 𝑋+𝐵
layer.
𝑌 = 𝐴 = 𝜃(𝑍)
• This is called feedforward propagation.
8
Backpropagation Algorithm (Cont..)
• Error Detection
• An error function, E(X,Θ), which
defines the error between the
desired output Y and the 𝐵[ ]
𝐵[ ]
Predicted
Output
Actual
Output
calculated output Y’ of the neural 𝐴
[ ]
𝑌
network on input xi for a set of
[ ]
𝐴 𝑌
[ ] [ ] [ ] [ ]
𝑋 𝑍 ,𝐴 𝑍 ,𝐴 𝐴 = 𝐴[ ] 𝑌
input-output pairs (xi,yi) X and a ⋮
[ ]
𝑌
⋮
• The error is determined through Loss Function = Actual output – predicted output
ℒ=𝑌 −𝐴
the loss function denoted as L.
9
Backpropagation Algorithm (Cont..)
• Backpropagation – Calculus: Derivative
• With calculus, we can calculate how much the value of
one variable changes depending on the change in
another variable.
• If we want to find out how a change in a variable x by
the fraction dx affects a related variable y, we can use
calculus to do that.
• The change dx in x would change y by dy.
• In Calculus notation, we express this relationship as
follows.
𝜕𝑦
𝑑𝑥 =
𝜕𝑥
• This is known as the derivative of y with respect to x.
10
Backpropagation Algorithm (Cont..)
11
Backpropagation Algorithm (Cont..)
• Backpropagation
• The principle behind the back propagation
algorithm is to reduce the error values in
randomly allocated weights and biases such that
it produces the correct output.
• The system is trained in the supervised learning
method, where the error between the system’s
output and a known expected output is
presented to the system and used to modify its
internal state.
• We perform this by calculating the gradient of L
with respect to W in the Neural Network model.
12
Backpropagation Algorithm (Cont..)
13
Backpropagation Algorithm (Cont..)
𝜕𝐿
𝐵 =𝐵 −𝛼
𝜕𝐵
• Weight Update – Learning Rate, Why?
• In fact, subtracting the gradient as-is from the weight will likely result in a step that is too big.
• Before subtracting, we therefore multiply the derivative with a small value α called the
learning rate.
• If we don’t use this learning rate, the weights are manipulated too quickly and the network
won’t learn properly.
19
Backpropagation Algorithm (Cont..)
• Iterations
• It is required to repeat this process many
iterations or epochs over until we find a
local minimum.
• With each epoch, the model moves the
weights according to the gradient to find
the best weights.
• Now, this is a loss optimization for a
particular example in our training dataset.
20
Types of Gradient Descent Algorithms
22
Types of Gradient Descent Algorithms (Cont..)
23
Types of Gradient Descent Algorithms (Cont..)
24
Types of Gradient Descent Algorithms (Cont..)
25
Types of Gradient Descent Algorithms (Cont..)
26
Types of Gradient Descent Algorithms (Cont..)
27
Backpropagation – Chain Rule: Ex 01
First, we discuss some examples that implement the
derivations of loss with respect to weights and biases.
𝐹 = 𝑋+𝑌 𝑍
Let X=1, Y=2, and Z=3,
X
In forward propagation U=3, and F = 9.
U= X+Y 𝜕𝐹 𝜕𝐹 𝜕𝐹
In backpropagation we have to find
Y 𝜕𝑋 𝜕𝑌 𝜕𝑍
F= UZ We know that U=X+Y, and F=UZ
𝜕𝐹 𝜕𝐹 𝜕𝐹
Z Apply chain rule to find the
𝜕𝑋 𝜕𝑌 𝜕𝑍
𝜕𝐹 𝜕𝐹 𝜕𝑈 𝜕𝐹 𝜕𝐹 𝜕𝑈 𝜕𝐹 𝜕𝐹
= . =𝑍 = . =𝑍 = =𝑈
𝜕𝑋 𝜕𝑈 𝜕𝑋 𝜕𝑌 𝜕𝑈 𝜕𝑌 𝜕𝑍 𝜕𝑍
28
Backpropagation – Chain Rule: Ex 02
𝐹 = 5(𝑋𝑌 + 𝑍)
X
U= XY
Y
V= U+Z F= 5V
𝜕𝐹 𝜕𝐹 𝜕𝑉 𝜕𝑈 𝜕𝐹 𝜕𝐹 𝜕𝑉 𝜕𝑈 𝜕𝐹 𝜕𝐹 𝜕𝑉
= . . = . . = .
𝜕𝑋 𝜕𝑉 𝜕𝑈 𝜕𝑋 𝜕𝑌 𝜕𝑉 𝜕𝑈 𝜕𝑌 𝜕𝑍 𝜕𝑉 𝜕𝑍
=5.Y =5.X =5
29
Backpropagation – Chain Rule: Ex 03
𝜕𝐹 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑉 𝜕𝐹 𝜕𝐵 𝜕𝑊
= . . . + . . . . + . .
𝐹 = 8[ 𝑋 + 𝑌 𝑌 + 𝑍 + 𝑌𝑍] 𝜕𝑌 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑌 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑉 𝜕𝑌 𝜕𝐵 𝜕𝑊 𝜕𝑌
𝜕𝐹
= 8.1. 𝑉. 1 + 8.1. 𝑈. 1 + (8.1. 𝑍)
𝜕𝑌
X
𝜕𝐹
U= X+Y = 8𝑉 + 8𝑈 + 8𝑍 = 8(𝑉 + 𝑈 + 𝑍)
𝜕𝑌
Y
A= UV B= A+W F= 8B 𝜕𝐹
= 8 𝑌 + 𝑍 + 𝑋 + 𝑌 + 𝑍 = 8(2𝑌 + 2𝑍 + 𝑋)
V= Y+Z 𝜕𝑌
Z
W= YZ
𝜕𝐹 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑉 𝜕𝐹 𝜕𝐵 𝜕𝑊
= . . . + . .
𝑈 = 𝑋 + 𝑌, 𝑉 = 𝑌 + 𝑍, 𝑊 = 𝑌𝑍, 𝐴 = 𝑈𝑉, 𝐵 = 𝐴 + 𝑊, 𝐹 = 8𝐵 𝜕𝑍 𝜕𝐵 𝜕𝐴 𝜕𝑉 𝜕𝑍 𝜕𝐵 𝜕𝑊 𝜕𝑍
𝜕𝐹
𝜕𝐹 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑈 = 8.1. 𝑈. 1 + 8.1. 𝑌 = 8(𝑈 + 𝑌)
= . . . 𝜕𝑍
𝜕𝑋 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑋
𝜕𝐹
𝜕𝐹 = 8 𝑋 + 𝑌 + 𝑌 = 8(2Y + X)
= 8.1. V. 1 = 8V = 8(Y + Z) 𝜕𝑍
30
𝜕𝑋
Backpropagation – Single Node Neural Network
Single Node Neuron Computational Graph Forward Propagation
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ]
W11[1] 𝑌 = 𝐴[ ] = 𝜃(𝑍 [ ] )
X1(i)
𝐵[ ]
𝑊[ ] = [ 𝑊 ]
W21[1]
[ ]
X2(i) [𝟏] [𝟏]
𝒁 𝟏 , 𝑨𝟏 Y’ 𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌) 𝑊[ ] = 𝑊 𝑊
[ ]
𝑊
[ ]
𝑊
[ ]
𝑊[ ]
W31[1]
𝐵[ ] = [𝑏 [ ] ]
X3(i) [ ]
𝐴[ ] = [𝑎 ]
W41[1]
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 − log 𝑦′
B1[1](i)
X4(i) 𝐿 = −𝑦𝑙𝑜𝑔 𝑎[ ]
i = 1, 2, 3, ….
31
Number of Training Samples
Backpropagation – Single Node Neural Network (Cont…)
Backpropagation 𝜕𝐿 𝜕𝐿 𝜕𝑦′ 𝜕𝑍 [ ]
𝜕𝐿 = . .
𝜕𝐵 [ ] 𝜕𝑦′ 𝜕𝑍 [ ] 𝜕𝐵 [ ]
𝐵[ ] 𝜕𝐵 [ ]
[ ]
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝑍 [ ]
= . .
𝜕𝐵 [ ] 𝜕𝑎[ ] 𝜕𝑍 [ ] 𝜕𝐵 [ ]
𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌)
[ ]
𝑊[ ]
𝜕𝐿 𝜕(−𝑦 𝑙𝑜𝑔𝑎 ) −𝑦
𝜕𝐿 [ ]
= [ ]
= [ ]
𝜕𝐿 𝜕𝑎 𝜕𝑎 𝑎
𝜕𝑊 [ ] 𝜕𝐴[ ]
[ ] [ ]
𝜕𝑎 𝜕(𝜃 𝑍 ) [ ]
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ] = = 𝜃′(𝑍 )
𝜕𝑍 [ ] 𝜕𝑍 [ ]
[ ]
𝑦 = 𝐴[ ] = 𝜃 𝑍 [ ] =𝑎 𝜕𝑍 [ ] 𝜕(𝑊 [ ] 𝑋 + 𝐵[ ] )
[ ] [ ]
= [ ]
= 𝑑𝐵[ ] =1
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 𝑙𝑜𝑔 𝑦 = −𝑦 log 𝐴[ ] = −𝑦 log 𝑎 𝜕𝐵 𝜕𝐵
Hadamard Metrix
𝜕𝐿 [ ] =
−𝑦 [ ] Multiplication
[ ]
= 𝑑𝐵 [ ]
. 𝜃′(𝑍 ) 𝑥 𝑦𝑧 𝑢 𝑣𝑤
=
𝑥𝑢 𝑦𝑣 𝑧𝑤
𝜕𝐵 𝑎 𝑎𝑏 𝑐
×
𝑚𝑛 𝑜 𝑎𝑚𝑏𝑛 𝑐𝑜
32
1×m 1×m 1×m
Backpropagation – Single Node Neural Network (Cont…)
Backpropagation 𝜕𝐿 𝜕𝐿 𝜕𝑦′ 𝜕𝑍 [ ]
[
= 𝑑𝑊 =] . .
𝜕𝐿
𝜕𝑊 [ ] 𝜕𝑦′ 𝜕𝑍 [ ] 𝜕𝑊 [ ]
𝐵[ ] 𝜕𝐵 [ ]
[ ]
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝑍 [ ]
= . .
𝜕𝑊 [ ] 𝜕𝑎[ ] 𝜕𝑍 [ ] 𝜕𝑊 [ ]
𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌)
[ ]
𝑊[ ]
𝜕𝐿 𝜕𝑎 −𝑦
𝜕𝐿 .
[ ] 𝜕𝑍 [ ]
= [ ]
. 𝜃′(𝑍 [ ] )
𝜕𝐿 𝜕𝑎 𝑎
𝜕𝑊 [ ] 𝜕𝐴[ ]
𝜕𝑍 [ ] 𝜕(𝑊 [ ] 𝑋 + 𝐵[ ] )
= =𝑋
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ] 𝜕𝑊 [ ] 𝜕𝑊 [ ]
Hadamard Metrix
𝑦 = 𝐴[ ] =𝜃 𝑍[ ] =𝑎
[ ] 𝜕𝐿 [ ] =
−𝑦 [ ] Multiplication
= 𝑑𝑊 . 𝜃′(𝑍 )𝑋 𝑥 𝑦𝑧 𝑢 𝑣𝑤 𝑥𝑢 𝑦𝑣 𝑧𝑤
[ ]
𝜕𝑊 [ ] 𝑎
[ ]
𝑎𝑏 𝑐
×
𝑚𝑛 𝑜
=
𝑎𝑚𝑏𝑛 𝑐𝑜
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 𝑙𝑜𝑔 𝑦 = −𝑦 log 𝐴[ ] = −𝑦 log 𝑎 1×m 1×m 1×m 1×m
33
Backpropagation – Single Node Neural Network (Cont…)
Backpropagation 𝜕𝐿 −𝑦 [ ]
= 𝑑𝐵 [ ] = . 𝜃′(𝑍 )
𝜕𝐿 𝜕𝐵 [ ] [ ]
𝑎
𝐵[ ] 𝜕𝐵 [ ]
𝜕𝐿 [ ] =
−𝑦 [ ]
[ ]
= 𝑑𝑊 [ ]
. 𝜃′(𝑍 )𝑋
𝜕𝑊 𝑎
𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌)
𝑊[ ]
𝜕𝐿 Weight Update
𝜕𝐿
𝜕𝑊 [ ] 𝜕𝐴[ ]
𝜕𝐿
𝐵[ ] = 𝐵[ ] − 𝛼
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ] 𝜕𝐵[ ]
𝑦 = 𝐴[ ] = 𝜃 𝑍 [ ] =𝑎
[ ] 𝜕𝐿
𝑊[ ] = 𝑊[ ] − 𝛼
[ ] 𝜕𝑊 [ ]
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 𝑙𝑜𝑔 𝑦 = −𝑦 log 𝐴[ ] = −𝑦 log 𝑎
34
Backpropagation – Generalized Two Layer Network
(Cont…)
[ ]( ) [ ] [ ] [ ]
𝑏 𝑊 𝑊 𝑊
[ ] [ ] [ ]
𝑊 𝑊 𝑊
[ ] [ ] [ ]
𝑏
[ ]( ) 𝑊 𝑊 𝑊
[ ] [ ] [ ] [ ] [ ]
𝑍 ,𝐴 𝑍 ,𝐴 𝑌′
[ ] [ ] [ ]
( ) 𝑊 𝑊 𝑊
𝑥
[ ] [ ] [ ] [ ] [ ]
𝑍 ,𝐴 𝑍 ,𝐴 𝑌′
[ ] [ ] [ ]
( ) 𝑊 𝑊 𝑊
𝑥
[ ] [ ] [ ]
𝑊 𝑊 𝑊
[ ] [ ]
𝑍 ,𝐴 [ ] [ ]
𝑌′
[ ]
[ ] [ ] [ ] 𝑍 ,𝐴 35
𝑊 𝑊 𝑊
Backpropagation – Generalized Two Layer Network
(Cont…)
Computational Graph of the 2 Layer Neural Network
Forward Propagation: Summary
Layer 1 p nodes
𝐵[ ] 𝐵[ ]
𝑍[ ] = 𝑊[ ] 𝑋 + 𝐵[ ]
𝑋 𝑍 [ ] , 𝐴[ ] 𝑍 [ ] , 𝐴[ ]
𝑌′ p×m p×n n×m p×m
𝑊[ ] 𝑊[ ]
𝐴[ ] = 𝜃(𝑍 [ ] )
p×m p×m
Layer 2 q nodes
𝑍[ ] = 𝑊[ ] 𝑋 + 𝐵[ ]
𝐴[ ] = 𝜃(𝑍 [ ] )
q×m q×m
36
Backpropagation – Generalized Two Layer Network
(Cont…)
𝜕𝐿 [ ]
𝜕(−𝑌𝑙𝑜𝑔𝐴[ ] − 1 − 𝑌 log 1 − 𝐴[ ] )
= 𝑑𝐴 =
𝜕𝐴[ ] 𝜕𝐴[ ]
𝜕𝐿 𝜕𝐿
𝐵[ ] 𝐵[ ]
𝜕𝐵 [ ] 𝜕𝐵 [ ]
𝜕𝐿 [ ]
−𝑌 (1 − 𝑌) 𝐴[ ] − 𝑌
𝜕𝐿 = 𝑑𝐴 = − =
𝜕𝐴[ ] (1 − 𝐴[ ] ) (1 − 𝐴[ ] ) 𝐴[ ] (1 − 𝐴[ ] )
𝜕𝐴[ ]
[ ] [ ] [ ] [ ]
𝑋 𝑍 ,𝐴 𝑍 ,𝐴 𝑌′
𝜕𝐿
𝜕𝐴[ ] All matrices
𝜕𝐿 𝜕𝐿 Derivative of Common Logarithm have q*m
𝑊[ ] 𝜕𝑊 [ ] 𝑊[ ] 𝜕𝑊 [ ] dimension
𝑑 1
𝑙𝑜𝑔 (𝑋) = and perform
𝑑𝑥 𝑋𝑙𝑛(𝑎) element vise
division.
Derivative of Natural Logarithm
𝐿 = −𝑌𝑙𝑜𝑔𝐴[ ] − 1 − 𝑌 log 1 − 𝐴[ ]
𝑑 1
𝑙𝑛 (𝑋) =
𝑑𝑥 𝑋
37
Vanishing Gradient Problem
• As more layers using certain activation functions are added to neural networks, the
gradients of the loss function approaches zero, making the network hard to train.
• Certain activation functions, like the sigmoid function, squishes a large input space into a
small input space between 0 and 1.
• Therefore, a large change in the input of the sigmoid function will cause a small change
in the output. Hence, the derivative becomes small.
38
Vanishing Gradient Problem (Cont…)
• Why it is Significant
• For shallow network with only a few layers that use these activations, this isn’t a big problem.
• However, when more layers are used, it can cause the gradient to be too small for training to
work effectively.
• Gradients of neural networks are found using backpropagation.
• Simply put, backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
• By the chain rule, the derivatives of each layer are multiplied down the network (from the
final layer to the initial) to compute the derivatives of the initial layers.
• However, when n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to the initial layers.
• A small gradient means that the weights and biases of the initial layers will not be updated
effectively with each training session.
• Since these initial layers are often crucial to recognizing the core elements of the input data,
it can lead to overall inaccuracy of the whole network.
39
Vanishing Gradient Problem (Cont…)
40
Overcome Vanishing Gradient Problem
• The vanishing gradient
problem is caused by the
derivative of the activation
function used to create the
neural network.
• Instead of sigmoid, use an
activation function such as
Rectified Linear Units (ReLU).
• ReLU are activation functions
that generate a positive
linear output when they are
applied to positive input
values. If the input is 𝑔 𝑥 = 𝑅𝑒𝐿𝑈 𝑥 = max(0, 𝑥) Derivative of ReLU is a
negative, the function will step function.
return zero.
41
Overcome Vanishing Gradient Problem (Cont…)
42
Learning Outcomes - Revisit
• Now students should be able to:
• Explain the Backpropagation Algorithm
• Apply the Backpropagation algorithm for
Neural Networks
• Explain the Gradient Descent Algorithm
• Identify the vanishing gradient problem
• Describe the ways to overcome vanishing
gradient problem
43
Thank you
Q&A
Next Session: Introduction to Deep Learning
44