Professional Documents
Culture Documents
(21MCA24DB3)
-The topology of the deep neural network (i.e. layers and their
interconnection)
- The learned parameters (i.e., the learned weights and biases)
The model is dependent upon the hyper parameters because the hyper
parameters determine the learned parameters (weights and biases).
Error Landscape
SSE:
Sum
Squared
Error
S (ti – i)2
0
Weight Values
Delta Learning Rule
(Widrow-Hoff Rule)
• Goal is to decrease overall error each time a weight is
changed
• Total Sum Squared Error (SSE) is called objective function E
= S (ti – zi)2
i wij j wjk
xi k yk
m
n l yl
xn
Input Hidden Output
layer layer layer
Erro r signals
(3) During the third stage i.e. back propagation of error, each
output unit compares its computed activation yk with the target
value tk to determine the associated error for that pattern with
that unit
- based on the error, the factor k ( k = 1, 2, …m) is computed
and is used to determine the error at the output unit y k back to
all units in the previous layer
- Similarly, the factor j (j=1,2,…p) is computed for each hidden
unit zj
(4) During 4th stage, weights and bias are updated
1. Initialization of Weight:
Step 1: Initialize the weight to small random values
2. Feed Forward:
Step 2: Each input unit receive the input signal x i and
transmits this signal to all units in the hidden layer
Step 3: Each hidden unit (zj , 1, 2…p) sum its weighted
input signals
Step 4: Each output unit (yk ,k = 1,2…m) sum
its weighted input signal
• It is slow, all previous layers are locked until gradients for the
current layer is calculated
• It suffers from vanishing and exploding gradients problem
• It suffers from overfitting & underfitting problem
• It considers predicted value & actual value only to calculate
error and to calculate gradients, related to the objective
function, partially related to the Backpropagation algorithm
• It doesn’t consider the spatial, associative and dis-
associative relationship between classes while calculating
errors, related to the objective function, partially related to the
Backpropagation algorithm
• The network may get trapped in a local minima even through
there is a much deeper minimum nearby
Back Propagation Neural Network
62
Optimizers in Deep Neural Network
Starting Point
Loss
Value of weight
Point of Convergence i.e where the cost function is at
its minimum Level
Gradient Descent
• Gradient descent is a way to minimize an objective
function 𝐽(𝜃)
• 𝐽(𝜃) : Objective function
• 𝜃 ∈ 𝑅𝑑 : Model’s parameters
• 𝜂 : Learning rate. This determines the size of the steps we
take to reach a (local) minimum.
𝛻 𝜃 𝐽(𝜃) 𝐽(𝜃 )
Update equation
𝜃(new) = 𝜃 − 𝜂 ∗ 𝛻 𝜃 𝐽( 𝜃)
𝑜𝑙𝑐𝑎𝑙
𝑚𝑖𝑛𝑖𝑚𝑢 𝑚
𝜃∗ 𝜃
Change in Weight
Solution: θ := θ - η∇ θ J(θ)
71
Stochastic Gradient Descent Learning
Algorithm
Stochastic Gradient Descent Learning Algorithm
Update equation
We need to calculate the
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃; 𝑥()i ,𝑦 ( 𝑖 ) ) gradients for the whole dataset
to perform just one update.
Code
Note : we shuffle the training data at every epoch
• Advantages of SDG:
-Frequent updates of model parameters hence,
converges in less time.
-Requires less memory as no need to store
values of loss functions.
-May get new minima’s.
• Disadvantages:
-High variance in model parameters.
-May shoot even after achieving global minima.
Stochastic Gradient Descent With Momentum
b1 b2 b3 ……..bn
Now we create a variable
Vt1 = b1
Vt2 = γ vt1 + b2
= .5 b1 + b2
Vt3 = γvt2 + b3
= γ(γ vt1 + b2) + b3
= γ2 vt1 + γb2 + b3
= .25 b1 + .5b2 + b3
And so on
Difference between Gradient Descent and stochastic
Gradient Descent Algorithms
Gradient Descent Stochastic Gradient Descent
1 Gradient Descent Algorithm uses SGD uses single Training sample data
the whole Training sample Data
2 Slow and computationally Faster and less computationally expensive
expensive algorithm than Batch GD
3 Not suggested for huge training Can be used for large training samples.
samples.
4 Deterministic in nature. Stochastic in nature
5 Gives optimal solution. Gives good solution but not optimal.
6 The data sample should be in a random
No random shuffling of points are order, and this is why we want to shuffle
required.
the training set for every epoch.
7 Convergence is slow. Reaches the convergence much faster.
8 Can’t escape shallow local minima SGD can escape shallow local minima more
easily. easily.
Comparison of trade-offs of Gradient Descent Variants
Sparse Data
Dense Data
Adaptive gradient (Adagrad) Optimizer
Same
Learning
Rate
GD, SGD, Mini-Batch SGD
Weights are changes learning rate remain same
Idea of Adaptive Gradient
Adagrad Optimizer
To avoid
division
by zero
Loss Function
Adaptive Learning Rate
● Previous algorithm uses the fixed learning rate
throughout the learning process
○ The learning rate has to be either set to be very small at
the beginning or periodically decrease the learning rate
We know that for given a single point (x, y), gradients of w would be the following:
Motivation for Adaptive Learning Rate
• Previous methods :
• Adagrad :
SGD ℝ 𝑑 ×𝑑 ⋯ ⋯
⋯
⋯
⋯
𝐺𝑡 = ⋯ ⋯
⋯
⋯
⋯ ⋯
Vectorize
Adagrad divides the learning rate by the square root
of the sum of squares of gradients.
Adaptive Gradient Algorithm (Adagrad)
SGD ℝ 𝑑 ×𝑑 ⋯ ⋯
⋯
⋯
⋯
𝐺𝑡 = ⋯ ⋯
⋯
⋯
⋯ ⋯
Vectorize
Advantages and Disadvantages of Adagrad
• Advantages :
• It is well-suited for dealing with sparse data (missing or
gaps in the data).
• It greatly improved the robustness of SGD.
• It eliminates the need to manually tune the learning rate.
• Disadvantage :
• Main weakness is its accumulation of the squared
gradients in the denominator
Adadelta
• Instead of inefficiently storing, the sum of gradients
is recursively defined as a decaying average of all
past squared gradients.
• We defines running average of squared gradients
E [g 2]t at time t:
Adadelta
Adadelta
Adagrad SGD
Adadelta
• Advantages:
-The learning rate does not decay and the
training does not stop.
• Disadvantages:
-Computationally expensive.
Root Mean Square Propagation
(RMSprop)
RootMeanSquarePropagation
(RMSprop)
• The Root Mean Square Propagation RMS Prop is
similar to Momentum,
• It is a technique to reduce the motion in the y-axis
and speed up gradient descent.
• For better understanding, let us denote the Y-axis
as the bias b and the X-axis as the weight W.
• It is called Root Mean Square because we square
the derivatives of both w and b parameters.
• RMS prop is a gradient based optimization
technique used in training neural networks.
• It was proposed by the father of back-
propagation, Geoffrey Hinton 2014.
• Gradients of very complex functions like
neural networks have a tendency to either
vanish or accelerate as the data propagates
through the function.
Root Mean Square Propagation (RMSprop)
• AdaGrad can result in a premature and
excessive decrease in learning rate
• RMSProp modifies AdaGrad to perform better
in non-‐convex surfaces
• Changes gradient accumulation by an
exponentially decaying average of sum of
squares of gradients
Root Mean Square Propagation (RMSprop)
RMSprop
Adam (adaptive moment Estimation)
Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to
E[g(t)] where, E[f(x)] is an expected value of f(x).
• To update the parameter:
θ(New) = θ (old) - vt
Nesterov Accelerated Gradient