You are on page 1of 44

ICT3212 – Introduction to Intelligence Systems

COM3303 – Artificial Intelligence

Neural Networks - III

K.A.S.H. Kulathilake
Ph.D., M.Phil., MCS, B.Sc.(Hons.)IT, SEDA(UK)

Senior Lecturer
Department of Computing
Rajarata University of Sri Lanka
kule@as.rjt.ac.lk

1
Learning Outcomes

• At the end of this lecture students should be able to:


• Explain the Backpropagation Algorithm
• Apply the Backpropagation algorithm for Neural Networks
• Explain the Gradient Descent Algorithm
• Identify the vanishing gradient problem
• Describe the ways to overcome vanishing gradient problem

2
Contents

• Backpropagation Algorithm
• Backpropagation
• Vanishing Gradient Problem
• Overcome Vanishing Gradient Problem
• Types of Gradient Descent Algorithms

3
In forward propagation, we propagate

Backpropagation Algorithm the input to the output. In


backpropagation, we transfer the
information obtained from the output
back to the input. The
• What? backpropagation process is the
reverse of the forward propagation
• Artificial neural networks use backpropagation as a process. Backpropagation tunes W
learning algorithm to compute a gradient descent and B in the neural network.
with respect to weight values for the various
inputs.
• The algorithm gets its name because the weights
are updated backward, from output to input.

4
Backpropagation Algorithm (Cont..)

• Why?
• The goal of backpropagation is to adjust the weights and biases throughout
the neural network based on the calculated loss (error) so that the loss
(error) will be lower in the next iteration.
• Ultimately, we want to find a minimum value for the loss function.

• How?
• The adjustment works by finding the gradient of the loss function through
the chain rule of calculus.

5
Backpropagation Algorithm (Cont..)

• Steps of Backpropagation Algorithm


• Parameter Initialization
• Forward Propagation
• Error Detection Function
• Backpropagation
• Weight Update
• Iteration

6
Backpropagation Algorithm (Cont..)

• Parameter Initialization
• In this, parameters, i.e., weights and biases, associated with an artificial
neuron are randomly initialized.
• After receiving the input, the network feeds forwards the input and it makes
associations with weights and biases to give the output.
• The output associated with those random values is most probably not correct.

7
Backpropagation Algorithm (Cont..)

• Forward Propagation
• After initialization, when the input is given to the
input layer, it propagates the input into hidden
units at each layer.
• The nodes here do their job without being aware
of whether the results produced are accurate or
not (i.e., they don’t re-adjust according to the
results produced).
• Then, finally, the output is produced at the output 𝑍 =𝑊 𝑋+𝐵
layer.
𝑌 = 𝐴 = 𝜃(𝑍)
• This is called feedforward propagation.

8
Backpropagation Algorithm (Cont..)

• Error Detection
• An error function, E(X,Θ), which
defines the error between the
desired output Y and the 𝐵[ ]
𝐵[ ]
Predicted
Output
Actual
Output
calculated output Y’ of the neural 𝐴
[ ]
𝑌
network on input xi for a set of
[ ]
𝐴 𝑌
[ ] [ ] [ ] [ ]
𝑋 𝑍 ,𝐴 𝑍 ,𝐴 𝐴 = 𝐴[ ] 𝑌
input-output pairs (xi,yi) X and a ⋮
[ ]
𝑌

particular value of the 𝑊[ ]


𝑊[ ]
𝐴
q output
nodes
parameters Θ. (q classes)

• The error is determined through Loss Function = Actual output – predicted output
ℒ=𝑌 −𝐴
the loss function denoted as L.

9
Backpropagation Algorithm (Cont..)
• Backpropagation – Calculus: Derivative
• With calculus, we can calculate how much the value of
one variable changes depending on the change in
another variable.
• If we want to find out how a change in a variable x by
the fraction dx affects a related variable y, we can use
calculus to do that.
• The change dx in x would change y by dy.
• In Calculus notation, we express this relationship as
follows.
𝜕𝑦
𝑑𝑥 =
𝜕𝑥
• This is known as the derivative of y with respect to x.

10
Backpropagation Algorithm (Cont..)

• Backpropagation – Calculus: Gradient


• The first derivative of a function gives you the slope of
that function at the evaluated coordinate.
• If you have functions with several variables, you can take
partial derivatives with respect to every variable and
stack them in a vector.
• This gives you a vector that contains the slopes with
respect to every variable.
• Collectively the slopes point in the direction of the
steepest ascent along the function.
• This vector is also known as the gradient of a function.
• Going in the direction of the negative gradient gives us
the direction of the steepest descent.
• Going down the route of the steepest descent, we will
eventually end up at a minimum value of the function.

11
Backpropagation Algorithm (Cont..)

• Backpropagation
• The principle behind the back propagation
algorithm is to reduce the error values in
randomly allocated weights and biases such that
it produces the correct output.
• The system is trained in the supervised learning
method, where the error between the system’s
output and a known expected output is
presented to the system and used to modify its
internal state.
• We perform this by calculating the gradient of L
with respect to W in the Neural Network model.
12
Backpropagation Algorithm (Cont..)

• Weight Update – Motivation for Gradient Descent Algorithm


• Once we have calculated the gradient of L with respect to W, we can subtract
it from the original weight wold to move in the direction of the minimum
value of the loss function (Error minimizing).
• But in a non-linear function, the gradient will be different at every point along
the function.
• Therefore, it is difficult to calculate the gradient once and expect it to lead the
learning model straight to the minimum value.
• Instead, It is required to take a very small step in the direction of the current
gradient, recalculate the gradient based on the new location, take a step in
that direction, and repeat the process.

13
Backpropagation Algorithm (Cont..)

• Weight Update – Adjusting the weights with


Gradient Descent Algorithm
• This method is the key to minimizing the loss function
and achieving our target, which is to predict close to the
original value.
• If we observe the loss function graph, we will see it is
basically a parabolic shape or a convex shape, it has a
specific global minimum which we need to find in order
to find the minimum loss function value.
• So, we always try to use a loss function which is convex
in shape in order to get a proper minimum.
• Now, we see the predicted results depend on the
weights from the equation.

Loss Function Graph


14
Backpropagation Algorithm (Cont..)

• Weight Update – Loss Function Graph


• Initially, the model assigns random weights to the
features. So, say it initializes the weight=a.
• Initially, model can generate a loss which is far from the
minimum point Lmin.
• Now, we can see that if we move the weights more
towards the positive x-axis we can optimize the loss
function and achieve minimum value.
• But, how will the machine know?
• We need to optimize weight to minimize error, so, obviously,
we need to check how the error varies with the weights.
• To do this we need to find the derivative of the Error with
respect to the weight (gradient).
Loss Function Graph
15
Backpropagation Algorithm (Cont..)
• Weight Update – Adjusting the weights with Gradient
Descent Algorithm
• If the loss increases with an increase in weight so gradient will
be positive, So we are basically at the point C, where we can see
this statement is true.
• If loss decreases with an increase in weight so gradient will be
negative. We can see point A, corresponds to such a situation.
• Now, from point A we need to move towards positive x-axis and
the gradient is negative.
• From point C, we need to move towards negative x-axis but the
gradient is positive.
• So, always the negative of the Gradient shows the directions
along which the weights should be moved in order to optimize
the loss function.
• So, this way the gradient guides the model whether to increase
or decrease weights in order to optimize the loss function.

Loss Function Graph


16
Backpropagation Algorithm (Cont..)

• Weight Update – Adjusting the weights with


Gradient Descent Algorithm
• The model found which way to move, now the model
needs to find by how much it should move the weights.
• This is decided by a parameter called Learning Rate
denoted by Alpha.
• In the graph we see, the weights are moved from point
A to point B which are at a distance of dx.
𝜕𝐿
𝑑𝑥 = 𝛼
𝜕𝑊

• So, the distance to move is the product of learning rate


parameter alpha and the magnitude of change in error
with a change in weight at that point.
Loss Function Graph
17
Backpropagation Algorithm (Cont..)

• Weight Update - Adjusting the weights with Gradient Descent


Algorithm
• We need to update the weights and bias such that we get the global loss
minimum according to the Gradient Descent Algorithm.
• New weights will be obtaining subtracting the derivatives of the loss with
respect to the weights from the current weights (Wold).
𝜕𝐿
𝑊 =𝑊 −𝛼
𝜕𝑊
• Similarly, new bias will be obtaining subtracting the derivatives of the loss
with respect to the bias from the current bias (Bold).
𝜕𝐿
𝐵 =𝐵 −𝛼
𝜕𝐵
18
𝜕𝐿
𝑊 =𝑊 −𝛼
Backpropagation Algorithm (Cont..) 𝜕𝑊

𝜕𝐿
𝐵 =𝐵 −𝛼
𝜕𝐵
• Weight Update – Learning Rate, Why?
• In fact, subtracting the gradient as-is from the weight will likely result in a step that is too big.
• Before subtracting, we therefore multiply the derivative with a small value α called the
learning rate.
• If we don’t use this learning rate, the weights are manipulated too quickly and the network
won’t learn properly.

19
Backpropagation Algorithm (Cont..)

• Iterations
• It is required to repeat this process many
iterations or epochs over until we find a
local minimum.
• With each epoch, the model moves the
weights according to the gradient to find
the best weights.
• Now, this is a loss optimization for a
particular example in our training dataset.

20
Types of Gradient Descent Algorithms

• In general, the training dataset contains thousands of examples.


• Therefore, it will take a huge time to find optimal weights for all.
• Experiments have shown that if we optimize on only one sample of
our training set, the weight optimization is good enough for the
whole dataset.
• So, depending upon the methods we have different types of gradient
descent mechanisms.
• Stochastic Gradient Descent
• Batch Gradient Descent
• Mini-Batch Gradient Descent
21
Types of Gradient Descent Algorithms (Cont..)

• Stochastic Gradient Descent (SGD)


• When we train the model to optimize the loss function using only one
particular example from our dataset, it is called Stochastic Gradient Descent.
• We do the following steps in one epoch for SGD:
1. Take an example
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the examples in training dataset

22
Types of Gradient Descent Algorithms (Cont..)

• Stochastic Gradient Descent (SGD)


• Since we are considering just one example at a
time the cost will fluctuate over the training
examples and it will not necessarily decrease.
• But in the long run, you will see the cost
decreasing with fluctuations.
• Also, because the cost is so fluctuating, it will
never reach the minima but it will keep dancing
around it.
• SGD can be used for larger datasets.
• It converges faster when the dataset is large as it
causes updates to the parameters more
frequently.

23
Types of Gradient Descent Algorithms (Cont..)

• Batch Gradient Descent


• When we train the model to optimize the loss
function using the mean of all the individual
losses in our whole dataset, it is called Batch
Gradient Descent.
• So that is just one step of gradient descent in one
epoch.
• The graph of Loss vs epochs is also quite smooth
because we are averaging over all the gradients
of training data for a single step.
• The Loss keeps on decreasing over the epochs.
• It takes a lot of time to train if the data set is
huge, and is therefore somewhat inefficient

24
Types of Gradient Descent Algorithms (Cont..)

• Mini-Batch Gradient Descent


• According to the SGD, the model is trained using only 1 example.
• So, how good do you think a model will learn if it is shown only one example
and told to learn about all other examples?
• Moreover, there is a possibility that the model may get too biased with the
peculiarity of that particular example.
• Since in SGD we use only one example at a time, we cannot implement the
vectorized implementation on it.
• This can slow down the computations.
• To tackle this problem, a mixture of Batch Gradient Descent and SGD is used.

25
Types of Gradient Descent Algorithms (Cont..)

• Mini-Batch Gradient Descent


• We use a batch of a fixed number of training examples which is less than the
actual dataset and call it a Mini-batch.
• Doing this helps us achieve the advantages of both the former variants we
saw.
• So, after creating the mini-batches of fixed size, we do the following steps in
one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created

26
Types of Gradient Descent Algorithms (Cont..)

• Mini-Batch Gradient Descent


• Just like SGD, the average cost over the epochs in
mini-batch gradient descent fluctuates because we
are averaging a small number of examples at a
time.
• So, when we are using the mini-batch gradient
descent we are updating our parameters frequently
as well as we can use vectorized implementation
for faster computations.

27
Backpropagation – Chain Rule: Ex 01
First, we discuss some examples that implement the
derivations of loss with respect to weights and biases.

𝐹 = 𝑋+𝑌 𝑍
Let X=1, Y=2, and Z=3,
X
In forward propagation U=3, and F = 9.
U= X+Y 𝜕𝐹 𝜕𝐹 𝜕𝐹
In backpropagation we have to find
Y 𝜕𝑋 𝜕𝑌 𝜕𝑍
F= UZ We know that U=X+Y, and F=UZ
𝜕𝐹 𝜕𝐹 𝜕𝐹
Z Apply chain rule to find the
𝜕𝑋 𝜕𝑌 𝜕𝑍

𝜕𝐹 𝜕𝐹 𝜕𝑈 𝜕𝐹 𝜕𝐹 𝜕𝑈 𝜕𝐹 𝜕𝐹
= . =𝑍 = . =𝑍 = =𝑈
𝜕𝑋 𝜕𝑈 𝜕𝑋 𝜕𝑌 𝜕𝑈 𝜕𝑌 𝜕𝑍 𝜕𝑍

28
Backpropagation – Chain Rule: Ex 02
𝐹 = 5(𝑋𝑌 + 𝑍)

X
U= XY
Y
V= U+Z F= 5V

U=XY, V=U+Z, and V=5V

𝜕𝐹 𝜕𝐹 𝜕𝑉 𝜕𝑈 𝜕𝐹 𝜕𝐹 𝜕𝑉 𝜕𝑈 𝜕𝐹 𝜕𝐹 𝜕𝑉
= . . = . . = .
𝜕𝑋 𝜕𝑉 𝜕𝑈 𝜕𝑋 𝜕𝑌 𝜕𝑉 𝜕𝑈 𝜕𝑌 𝜕𝑍 𝜕𝑉 𝜕𝑍

=5.1.Y =5.1.X =5.1

=5.Y =5.X =5
29
Backpropagation – Chain Rule: Ex 03
𝜕𝐹 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑉 𝜕𝐹 𝜕𝐵 𝜕𝑊
= . . . + . . . . + . .
𝐹 = 8[ 𝑋 + 𝑌 𝑌 + 𝑍 + 𝑌𝑍] 𝜕𝑌 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑌 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑉 𝜕𝑌 𝜕𝐵 𝜕𝑊 𝜕𝑌
𝜕𝐹
= 8.1. 𝑉. 1 + 8.1. 𝑈. 1 + (8.1. 𝑍)
𝜕𝑌
X
𝜕𝐹
U= X+Y = 8𝑉 + 8𝑈 + 8𝑍 = 8(𝑉 + 𝑈 + 𝑍)
𝜕𝑌
Y
A= UV B= A+W F= 8B 𝜕𝐹
= 8 𝑌 + 𝑍 + 𝑋 + 𝑌 + 𝑍 = 8(2𝑌 + 2𝑍 + 𝑋)
V= Y+Z 𝜕𝑌
Z

W= YZ
𝜕𝐹 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑉 𝜕𝐹 𝜕𝐵 𝜕𝑊
= . . . + . .
𝑈 = 𝑋 + 𝑌, 𝑉 = 𝑌 + 𝑍, 𝑊 = 𝑌𝑍, 𝐴 = 𝑈𝑉, 𝐵 = 𝐴 + 𝑊, 𝐹 = 8𝐵 𝜕𝑍 𝜕𝐵 𝜕𝐴 𝜕𝑉 𝜕𝑍 𝜕𝐵 𝜕𝑊 𝜕𝑍
𝜕𝐹
𝜕𝐹 𝜕𝐹 𝜕𝐵 𝜕𝐴 𝜕𝑈 = 8.1. 𝑈. 1 + 8.1. 𝑌 = 8(𝑈 + 𝑌)
= . . . 𝜕𝑍
𝜕𝑋 𝜕𝐵 𝜕𝐴 𝜕𝑈 𝜕𝑋
𝜕𝐹
𝜕𝐹 = 8 𝑋 + 𝑌 + 𝑌 = 8(2Y + X)
= 8.1. V. 1 = 8V = 8(Y + Z) 𝜕𝑍
30
𝜕𝑋
Backpropagation – Single Node Neural Network
Single Node Neuron Computational Graph Forward Propagation

𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ]

W11[1] 𝑌 = 𝐴[ ] = 𝜃(𝑍 [ ] )
X1(i)
𝐵[ ]
𝑊[ ] = [ 𝑊 ]
W21[1]
[ ]
X2(i) [𝟏] [𝟏]
𝒁 𝟏 , 𝑨𝟏 Y’ 𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌) 𝑊[ ] = 𝑊 𝑊
[ ]
𝑊
[ ]
𝑊
[ ]

𝑊[ ]
W31[1]
𝐵[ ] = [𝑏 [ ] ]
X3(i) [ ]
𝐴[ ] = [𝑎 ]
W41[1]
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 − log 𝑦′
B1[1](i)
X4(i) 𝐿 = −𝑦𝑙𝑜𝑔 𝑎[ ]

i = 1, 2, 3, ….
31
Number of Training Samples
Backpropagation – Single Node Neural Network (Cont…)

Backpropagation 𝜕𝐿 𝜕𝐿 𝜕𝑦′ 𝜕𝑍 [ ]
𝜕𝐿 = . .
𝜕𝐵 [ ] 𝜕𝑦′ 𝜕𝑍 [ ] 𝜕𝐵 [ ]
𝐵[ ] 𝜕𝐵 [ ]
[ ]
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝑍 [ ]
= . .
𝜕𝐵 [ ] 𝜕𝑎[ ] 𝜕𝑍 [ ] 𝜕𝐵 [ ]
𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌)
[ ]
𝑊[ ]
𝜕𝐿 𝜕(−𝑦 𝑙𝑜𝑔𝑎 ) −𝑦
𝜕𝐿 [ ]
= [ ]
= [ ]
𝜕𝐿 𝜕𝑎 𝜕𝑎 𝑎
𝜕𝑊 [ ] 𝜕𝐴[ ]
[ ] [ ]
𝜕𝑎 𝜕(𝜃 𝑍 ) [ ]
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ] = = 𝜃′(𝑍 )
𝜕𝑍 [ ] 𝜕𝑍 [ ]
[ ]
𝑦 = 𝐴[ ] = 𝜃 𝑍 [ ] =𝑎 𝜕𝑍 [ ] 𝜕(𝑊 [ ] 𝑋 + 𝐵[ ] )
[ ] [ ]
= [ ]
= 𝑑𝐵[ ] =1
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 𝑙𝑜𝑔 𝑦 = −𝑦 log 𝐴[ ] = −𝑦 log 𝑎 𝜕𝐵 𝜕𝐵
Hadamard Metrix
𝜕𝐿 [ ] =
−𝑦 [ ] Multiplication
[ ]
= 𝑑𝐵 [ ]
. 𝜃′(𝑍 ) 𝑥 𝑦𝑧 𝑢 𝑣𝑤
=
𝑥𝑢 𝑦𝑣 𝑧𝑤
𝜕𝐵 𝑎 𝑎𝑏 𝑐
×
𝑚𝑛 𝑜 𝑎𝑚𝑏𝑛 𝑐𝑜
32
1×m 1×m 1×m
Backpropagation – Single Node Neural Network (Cont…)

Backpropagation 𝜕𝐿 𝜕𝐿 𝜕𝑦′ 𝜕𝑍 [ ]
[
= 𝑑𝑊 =] . .
𝜕𝐿
𝜕𝑊 [ ] 𝜕𝑦′ 𝜕𝑍 [ ] 𝜕𝑊 [ ]
𝐵[ ] 𝜕𝐵 [ ]
[ ]
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝑍 [ ]
= . .
𝜕𝑊 [ ] 𝜕𝑎[ ] 𝜕𝑍 [ ] 𝜕𝑊 [ ]
𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌)
[ ]
𝑊[ ]
𝜕𝐿 𝜕𝑎 −𝑦
𝜕𝐿 .
[ ] 𝜕𝑍 [ ]
= [ ]
. 𝜃′(𝑍 [ ] )
𝜕𝐿 𝜕𝑎 𝑎
𝜕𝑊 [ ] 𝜕𝐴[ ]
𝜕𝑍 [ ] 𝜕(𝑊 [ ] 𝑋 + 𝐵[ ] )
= =𝑋
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ] 𝜕𝑊 [ ] 𝜕𝑊 [ ]
Hadamard Metrix
𝑦 = 𝐴[ ] =𝜃 𝑍[ ] =𝑎
[ ] 𝜕𝐿 [ ] =
−𝑦 [ ] Multiplication
= 𝑑𝑊 . 𝜃′(𝑍 )𝑋 𝑥 𝑦𝑧 𝑢 𝑣𝑤 𝑥𝑢 𝑦𝑣 𝑧𝑤
[ ]
𝜕𝑊 [ ] 𝑎
[ ]
𝑎𝑏 𝑐
×
𝑚𝑛 𝑜
=
𝑎𝑚𝑏𝑛 𝑐𝑜
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 𝑙𝑜𝑔 𝑦 = −𝑦 log 𝐴[ ] = −𝑦 log 𝑎 1×m 1×m 1×m 1×m

33
Backpropagation – Single Node Neural Network (Cont…)

Backpropagation 𝜕𝐿 −𝑦 [ ]
= 𝑑𝐵 [ ] = . 𝜃′(𝑍 )
𝜕𝐿 𝜕𝐵 [ ] [ ]
𝑎
𝐵[ ] 𝜕𝐵 [ ]

𝜕𝐿 [ ] =
−𝑦 [ ]
[ ]
= 𝑑𝑊 [ ]
. 𝜃′(𝑍 )𝑋
𝜕𝑊 𝑎
𝑋 𝑍 [ ] , 𝐴[ ] 𝐿 = (𝐴[ ] , 𝑌)
𝑊[ ]

𝜕𝐿 Weight Update
𝜕𝐿
𝜕𝑊 [ ] 𝜕𝐴[ ]
𝜕𝐿
𝐵[ ] = 𝐵[ ] − 𝛼
𝑍[ ] = 𝑊 [ ] 𝑋 + 𝐵[ ] 𝜕𝐵[ ]

𝑦 = 𝐴[ ] = 𝜃 𝑍 [ ] =𝑎
[ ] 𝜕𝐿
𝑊[ ] = 𝑊[ ] − 𝛼
[ ] 𝜕𝑊 [ ]
𝐿𝑜𝑠𝑠 𝐿 = −𝑦 𝑙𝑜𝑔 𝑦 = −𝑦 log 𝐴[ ] = −𝑦 log 𝑎

34
Backpropagation – Generalized Two Layer Network
(Cont…)
[ ]( ) [ ] [ ] [ ]
𝑏 𝑊 𝑊 𝑊
[ ] [ ] [ ]
𝑊 𝑊 𝑊
[ ] [ ] [ ]
𝑏
[ ]( ) 𝑊 𝑊 𝑊
[ ] [ ] [ ] [ ] [ ]
𝑍 ,𝐴 𝑍 ,𝐴 𝑌′

[ ] [ ] [ ]
( ) 𝑊 𝑊 𝑊
𝑥

[ ] [ ] [ ] [ ] [ ]
𝑍 ,𝐴 𝑍 ,𝐴 𝑌′
[ ] [ ] [ ]
( ) 𝑊 𝑊 𝑊
𝑥
[ ] [ ] [ ]
𝑊 𝑊 𝑊

[ ] [ ]
𝑍 ,𝐴 [ ] [ ]
𝑌′
[ ]
[ ] [ ] [ ] 𝑍 ,𝐴 35
𝑊 𝑊 𝑊
Backpropagation – Generalized Two Layer Network
(Cont…)
Computational Graph of the 2 Layer Neural Network
Forward Propagation: Summary
Layer 1  p nodes
𝐵[ ] 𝐵[ ]

𝑍[ ] = 𝑊[ ] 𝑋 + 𝐵[ ]

𝑋 𝑍 [ ] , 𝐴[ ] 𝑍 [ ] , 𝐴[ ]
𝑌′ p×m p×n n×m p×m
𝑊[ ] 𝑊[ ]

𝐴[ ] = 𝜃(𝑍 [ ] )
p×m p×m

Layer 2  q nodes
𝑍[ ] = 𝑊[ ] 𝑋 + 𝐵[ ]

q×m q×p p×m q×m

𝐴[ ] = 𝜃(𝑍 [ ] )
q×m q×m
36
Backpropagation – Generalized Two Layer Network
(Cont…)
𝜕𝐿 [ ]
𝜕(−𝑌𝑙𝑜𝑔𝐴[ ] − 1 − 𝑌 log 1 − 𝐴[ ] )
= 𝑑𝐴 =
𝜕𝐴[ ] 𝜕𝐴[ ]
𝜕𝐿 𝜕𝐿
𝐵[ ] 𝐵[ ]
𝜕𝐵 [ ] 𝜕𝐵 [ ]
𝜕𝐿 [ ]
−𝑌 (1 − 𝑌) 𝐴[ ] − 𝑌
𝜕𝐿 = 𝑑𝐴 = − =
𝜕𝐴[ ] (1 − 𝐴[ ] ) (1 − 𝐴[ ] ) 𝐴[ ] (1 − 𝐴[ ] )
𝜕𝐴[ ]
[ ] [ ] [ ] [ ]
𝑋 𝑍 ,𝐴 𝑍 ,𝐴 𝑌′
𝜕𝐿
𝜕𝐴[ ] All matrices
𝜕𝐿 𝜕𝐿 Derivative of Common Logarithm have q*m
𝑊[ ] 𝜕𝑊 [ ] 𝑊[ ] 𝜕𝑊 [ ] dimension
𝑑 1
𝑙𝑜𝑔 (𝑋) = and perform
𝑑𝑥 𝑋𝑙𝑛(𝑎) element vise
division.
Derivative of Natural Logarithm
𝐿 = −𝑌𝑙𝑜𝑔𝐴[ ] − 1 − 𝑌 log 1 − 𝐴[ ]
𝑑 1
𝑙𝑛 (𝑋) =
𝑑𝑥 𝑋
37
Vanishing Gradient Problem
• As more layers using certain activation functions are added to neural networks, the
gradients of the loss function approaches zero, making the network hard to train.
• Certain activation functions, like the sigmoid function, squishes a large input space into a
small input space between 0 and 1.
• Therefore, a large change in the input of the sigmoid function will cause a small change
in the output. Hence, the derivative becomes small.

On the graph you can see a


comparison between the sigmoid
function itself (blue) and its
derivative (red). First derivatives of
sigmoid functions are bell curves
with values ranging from 0 to 0.25.

38
Vanishing Gradient Problem (Cont…)
• Why it is Significant
• For shallow network with only a few layers that use these activations, this isn’t a big problem.
• However, when more layers are used, it can cause the gradient to be too small for training to
work effectively.
• Gradients of neural networks are found using backpropagation.
• Simply put, backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
• By the chain rule, the derivatives of each layer are multiplied down the network (from the
final layer to the initial) to compute the derivatives of the initial layers.
• However, when n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to the initial layers.
• A small gradient means that the weights and biases of the initial layers will not be updated
effectively with each training session.
• Since these initial layers are often crucial to recognizing the core elements of the input data,
it can lead to overall inaccuracy of the whole network.

39
Vanishing Gradient Problem (Cont…)

• Impact of the Vanishing Gradient Problem


• When it comes to deep networks, the vanishing gradient could have a
significant impact on performance.
• The weights of the network remain unchanged as the derivative vanishes.
• During back propagation, a neural network learns by updating its weights and
biases to reduce the loss function.
• In a network with vanishing gradient, the weights cannot be updated, so the
network cannot learn.
• The performance of the network will decrease as a result.

40
Overcome Vanishing Gradient Problem
• The vanishing gradient
problem is caused by the
derivative of the activation
function used to create the
neural network.
• Instead of sigmoid, use an
activation function such as
Rectified Linear Units (ReLU).
• ReLU are activation functions
that generate a positive
linear output when they are
applied to positive input
values. If the input is 𝑔 𝑥 = 𝑅𝑒𝐿𝑈 𝑥 = max(0, 𝑥) Derivative of ReLU is a
negative, the function will step function.
return zero.

41
Overcome Vanishing Gradient Problem (Cont…)

• If the ReLU function is used for activation in a neural network in place of a


sigmoid function, the value of the partial derivative of the loss function will
be having values of 0 or 1 which prevents the gradient from vanishing.
• The use of ReLU function thus prevents the gradient from vanishing.
• The problem with the use of ReLU is when the gradient has a value of 0.
• In such cases, the node is considered as a dead node since the old and new
values of the weights remain the same.
• This situation can be avoided by the use of a leaky ReLU function which
prevents the gradient from falling to the zero value.

42
Learning Outcomes - Revisit
• Now students should be able to:
• Explain the Backpropagation Algorithm
• Apply the Backpropagation algorithm for
Neural Networks
• Explain the Gradient Descent Algorithm
• Identify the vanishing gradient problem
• Describe the ways to overcome vanishing
gradient problem

43
Thank you
Q&A
Next Session: Introduction to Deep Learning

44

You might also like