You are on page 1of 132

NEURAL NETWORKS & DEEP LEARNING

(21MCA24DB3)

Prepared & Presented By:


Dr. Balkishan
Assistant Professor
Department of Computer Science & Applications
Maharshi Dayanand University
Rohtak
Overfitting and Underfitting
• A model is said to be a good machine learning
model if it generalizes any new input data from the
problem domain in a proper way.
• This helps us to make predictions in the future
data, that the data model has never seen.
• Now, suppose we want to check how well our
machine learning model learns and generalizes to
the new data.
• For that, we have overfitting and underfitting,
which are majorly responsible for the poor
performances of the machine learning algorithms.
Underfitting
• A statistical model or a machine learning algorithm is said to have
underfitting when it cannot capture the underlying trend of the
data.
• Underfitting destroys the accuracy of our machine learning
model.
• Its occurrence simply means that our model or the algorithm
does not fit the data well enough.
• It usually happens when we have fewer data to build an accurate
model
• In such cases, the rules of the machine learning model are too easy
and flexible to be applied on such minimal data and therefore the
model will probably make a lot of wrong predictions.
• Underfitting can be avoided by using more data and also increasing
the features. 
• Underfitting – High bias (predicted values is far way from
the target values) and low variance (Low variance means
predicted values are close to each other)
 
Techniques to reduce underfitting: 
• Increase model complexity
• Increase the number of features, performing feature
engineering
• Remove noise from the data.
• Increase the number of epochs or increase the duration of
training to get better results.
Overfitting 

• A statistical model is said to be overfitted when we train it


with a lot of features and data
(just like fitting ourselves in oversized pants).
• When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our
data set.
• Then the model does not categorize the data correctly, because
of too many details and noise.
• A solution to avoid overfitting:
-Remove features
-Early stopping
-Regularization
-Ensembling technique etc
• Overfitting – High variance and low bias 
• Techniques to reduce overfitting:
• Increase training data.
• Reduce model complexity.
• Early stopping during the training phase (have
an eye over the loss over the training period as
soon as loss begins to increase stop training).
Regularizing a Deep Network
(Technique to prevent overfitting)
• Regularization is a technique which makes
slight modifications to the learning algorithm
such that the model generalizes better.
• This in turn improves the model’s
performance on the unseen data.
• Reduce the complexity of the model
Regularization
• Regularization is a technique used to reduce the errors by fitting the function
appropriately on the given training set and avoid overfitting. 
The commonly used regularization techniques are : 
 
-L2 regularization
-L1 regularization
-Dropout regularization
- Early Stopping Regularization

• A regression model that uses L2 regularization technique is called Ridge regression. 

• A regression model which uses L1 Regularization technique is called LASSO(Least


Absolute Shrinkage and Selection Operator) regression. 

Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to


the loss function(L). 
L2 Regularization
Equation (1), x is the independent variable, y is
the dependent variable and .7, 1.2, 21, 39 are
regression coefficients

Scale down version of equation (1)


What is Ridge Regression?

• Ridge regression is a model tuning method


that is used to analyze any data that suffers
from multicollinearity.
• This method performs L2 regularization.
• When the issue of multicollinearity occurs,
least-squares are unbiased, and variances are
large, this results in predicted values being far
away from the actual values. 
Important Observations
• In simple terms, the minimization objective = LS Obj + α (sum of
the square of coefficients)
• Where LS Obj is Least Square Objective that is the linear
regression objective without regularization.
• Here  α  is the turning factor that controls the strength of the
penalty term.
• If  α = 0, the objective becomes similar to simple linear regression.
So we get the same coefficients as simple linear regression.
• If  α = ∞, the coefficients will be zero because of infinite weightage
on the square of coefficients as anything less than zero makes the
objective infinite.
• If 0 < α < ∞, the magnitude of α decides the weightage given to the
different parts of the objective.
L1 Regularization
Dropout Regularization
• Randomly selected neurons are ignored during each
training step.

• Dropped neurons don’t have effect on next layers.

• Dropped neurons are not updated in backward training.


Model Exploration and Hyper Parameter Tuning

Model Parameters (Learned during training)


•Model Parameters are the entities learned via training from the
training data.
•They are not set manually by the designer.
With respect to deep neural networks, the model parameters are:
-Weights
- Biases
Model Hyper-parameters (Control the parameters)
•These are parameters that govern(control) the determination of the model
parameters during training
-They are typically set manually via heuristics
-They are tuned during a cross-validation phase
Examples:
Learning rate, number of layers, number of units in each layer,
activation functions, many others
• What is a model?
• Model contains the hyper parameters describe the neural network.
Because hyper parameters govern (control) the parameters of the
network.

Implicitly the model contains:

-The topology of the deep neural network (i.e. layers and their
interconnection)
- The learned parameters (i.e., the learned weights and biases)

The model is dependent upon the hyper parameters because the hyper
parameters determine the learned parameters (weights and biases).

Hyper parameters include:


-Learning Rate
-Number of Layers
-Number of Units in each Layer
-Activation Functions
-Etc.
Model Optimization

• To optimize the model (inference time behavior), a process


known as model selection is performed
• Model selection contains the selection of hyper parameters
that yield the best performance of the neural network
• The hyper parameters are tuned using an iterative process of
either:
-Validation
-Cross-Validation
• Many models may be evaluated during the validation/cross-
validation phase and the optimal model is selected
• The optimal model is then evaluated on the test dataset to
determine how well it performs on data never seen before
Training set-Validation Set and Test Set
• Training Set – Data set used to learn the optimal model
parameters (weights, biases)
• Validation (“Dev”) Set – Data set used to perform model
selection (tuning of hyper parameters)
• Used to estimate the generalization error of the training
allowing for the hyper parameters to be updated accordingly
• Test Set – Data set used to assess the fully trained model
• A fully trained model is the model that has been selected via
hyper parameter tuning and has been subsequently trained to
determine the optimal weights and biases (e.g., using back
propagation)
Train, Validation and Test Sets
Historical context and motivation for deep learning
• Deep learning is inspired by the brain but not all of the brain’s
details are relevant.
• For a comparison, Aeroplanes were inspired by birds. The
principle of flying is the same but the details are extremely different.
• The history of deep learning goes back to 1943 when Warren
McCulloch and Walter Pitts created a computer model based on
the neural networks of the human brain.
• They came up with the idea that neurons are threshold units with
on and off states. We could build a Boolean circuit by connecting
neurons with each other and conduct logical inference with neurons.
• The brain is basically a logical inference machine because
neurons are binary. Neurons compute a weighted sum of inputs and
compare that sum to its threshold.
• It turns on if it’s above the threshold and turns off if it’s below, which
is a simplified view of how neural networks work.
• In 1947, Donald Hebb had the idea that neurons in
the brain learn by modifying the strength of the
connections between neurons. This is called hyper
learning, where if two neurons are fired together,
then the connection linked between them increases;
if they don’t fire together, then the connection
decreases.
• In 1957, Frank Rosenblatt proposed the
Perceptron, which is a learning algorithm that
modifies the weights of very simple neural nets.
• Since 1940s, deep learning has evolved steadily, over
the years with two significant breaks in its
development.
• The development of the basics of a continuous Back
Propagation Model is credited to Henry J. Kelley in
1960.
• Stuart Dreyfus came up with a simpler version based
only on the chain rule in 1962.
• The concept of back propagation existed in the early
1960s but only became useful until 1985.  
• Overall, this idea of trying to build intellectual machines by
simulating lots of neurons was born in 1940s, took off in 1950s, and
completely died in late 1960s. The main reasons for the field dying
off in 1960 are:
• The researchers used neurons that were binary. However, the way to
get backpropagation to work is to use activation functions that are
continuous. At that time, researchers didn’t have the idea of
using continuous neurons and they didn’t think they can train with
gradients because binary neurons are not differential.
• With continuous neurons, one would have to multiply the activation
of a neuron by a weight to get a contribution to the weighted sum.
However, before 1980, the multiplication of two numbers,
especially floating-point numbers, were extremely slow. This
resulted in another incentive to avoid using continuous neurons.
• Deep Learning took off again in 1985 with the emergence of
backpropagation.
• In 1995, the field died again and the machine learning community
abandoned the idea of neural nets.
• In early 2010, people start using neuron nets in speech
recognition with huge performance improvement and later it
became widely deployed in the commercial field.
• In 2013, computer vision started to switch to neuron nets.
• In 2016, the same transition occurred in natural language
processing.
• Soon, similar revolutions will occur in robotics, control, and many
other fields.
Machine Learning

• Machine learning is a subfield of computer


science that explores the study and
construction of algorithms that can learn
from and make predictions on data.

• Such algorithms operate by building a


model from example inputs in order to
make data- driven predictions or
decisions, rather than following strictly
static program instructions.
Training and Testing of Model
Gradient Descent Learning
Mean Squared Error
• Mean Squared Error (MSE) – SSE/n where n
is the number of instances in the data set
– SSE means Sum Square Error
– This can be used to normalizes the error for data
sets of different sizes
– MSE is the average squared error per pattern
• Root Mean Squared Error (RMSE) – is the
square root of the MSE
– This puts the error value back into the same units
as the features and can thus be more intuitive
– RMSE is the average distance (error) of targets
from the outputs in the same scale as the features
Gradient Descent Learning: Minimize
(Maximze) the Objective Function

Error Landscape

SSE:
Sum
Squared
Error
S (ti – i)2

0
Weight Values
Delta Learning Rule
(Widrow-Hoff Rule)
• Goal is to decrease overall error each time a weight is
changed
• Total Sum Squared Error (SSE) is called objective function E
= S (ti – zi)2

• Delta learning rule is valid only for continuous activation


functions and in the supervised training mode
• The delta rule may be stated as “ the adjustment made to a
synaptic weight of a neuron is proportional to the product
of the error signal and the input signal of the synapse”
Delta Rule for Single Output Unit
• The delta rule changes the weight of the connection to
minimize the difference between the net input to the
output unit i.e. yin and the target value t
Delta rule is given as

-Where x is the vector of activation of input unit


-Yin is the net input to output unit
- t is the target vector, α is the learning rate
Difference between Perceptron and Delta
(Windrow-Hoff) Learning Rule
• The Windrow-Hoff is very similar to perceptron learning rule,
but their origins are different.
• Perceptron learning rule originates from the Hebbian
assumption while delta rule is derived from gradient-
descent method
• Perceptron learning rule stops after a finite number of
learning steps, but the gradient-descent approach continues
forever, converging only asymptotically to the solution.
• Delta rule updates the weights between the connections so as
to minimize the difference the net input to the output unit
and the target value
Back Prorogation Network (BPN)
• Back propogation is a multi-layered feed forward network
• It has minimum three layers
(i) input layer (ii) Hidden layer (iii) Output layer
• In this network information propogation is only in forward
direction and there is no feed back loop
• It does not have any feed back connection
• Only error are back propogated during the training
• The name back-propogation derives from the fact that
computation are passed forward from input to the output layer,
following which calculated errors propogated back in other
direction to change the weights to obtain better performance
• BPN is most widely used model in terms of
practical applications and about 90% of
commercial and industry applications uses this
model
• It was designed in 1986 by G E Hinton,
Rumelpart and R O Willianms
• The network used extended gradient-descent
based delta-learning rule commonly known as
back propogation learning rule
Generalized Delta Learning Rule or Back
Propogation Learning Rule
• The total square error of the output
computed by the net is minimized by
gradient descent method known as
generalized delta learning rule or back
propogation learning rule
• The mean square error of a particular training
pattern is

• Here E is a Cost function to quantify the difference


• Square makes error positive and penalizes large
errors more
• ½ makes the maths easier
• The gradient of E is a vector consisting of the
partial derivatives of E with respect to weight.
Training Algorithm of BPN
• Training algorithm of BPN involves four stages
- Initialization of weights
- Feed Forward
- Back propagation of Error
- Updation of weight and bias
(1) During first stage, i.e. initialization of weights
some random values are assigned.
(2) During second stage i.e. feed forward stage,
each input unit receives an input signals (xi ) and
transmits this signal to the hidden units (z1 .. To
zp )
- Each hidden unit after applying activation
function sends the signal to the output unit
- The output unit after applying activation
function to form the response of the net for the
given input unit
Three-layer back-propagation neural network
Input signals
1
x1
1 y1
1
2
x2 2 y2
2

i wij j wjk
xi k yk

m
n l yl
xn
Input Hidden Output
layer layer layer

Erro r signals
(3) During the third stage i.e. back propagation of error, each
output unit compares its computed activation yk with the target
value tk to determine the associated error for that pattern with
that unit
- based on the error, the factor k ( k = 1, 2, …m) is computed
and is used to determine the error at the output unit y k back to
all units in the previous layer
- Similarly, the factor j (j=1,2,…p) is computed for each hidden
unit zj
(4) During 4th stage, weights and bias are updated
1. Initialization of Weight:
Step 1: Initialize the weight to small random values
2. Feed Forward:
Step 2: Each input unit receive the input signal x i and
transmits this signal to all units in the hidden layer
Step 3: Each hidden unit (zj , 1, 2…p) sum its weighted
input signals
Step 4: Each output unit (yk ,k = 1,2…m) sum
its weighted input signal

and apply the activation function to calculate


the output signal
yk = f(yink )
Back Propogation of Error
• Step 5: each output unit (yk , k = 1,2,…m)
receive a target value corresponding to input
value and error information term is calculated
as
• On the basis of calculated error correction,
weights and bias are updated
• Step 6: Each hidden unit (zj , j = 1, 2…p) sum
its inputs from the output units

• The term multiplied with the derivative of


f(zinj ) to calculate the error signal
Limitations of the Backpropagation algorithm:

• It is slow, all previous layers are locked until gradients for the
current layer is calculated
• It suffers from vanishing and exploding gradients problem
• It suffers from overfitting & underfitting problem
• It considers predicted value & actual value only to calculate
error and to calculate gradients, related to the objective
function, partially related to the Backpropagation algorithm
• It doesn’t consider the spatial, associative and dis-
associative relationship between classes while calculating
errors, related to the objective function, partially related to the
Backpropagation algorithm
• The network may get trapped in a local minima even through
there is a much deeper minimum nearby
Back Propagation Neural Network

W(New) = W (old) + Change in weight


Vanishing and Exploding Gradient
Vanishing Gradient and Exploding Gradient Problems are difficulties found in
training certain Artificial Neural Networks with gradient based methods like
Back Propagation
How does Gradient Descent Algorithm Work
Vanishing Gradient Problem
Exploding Gradient Problem
Optimizers in Deep Neural Network
What is Function Optimization

● Optimization = minimizing or maximizing


● Maximizing a function f may be
accomplished via minimizing -f
• f is called an objective function
● In the case of minimization, f is also
called cost function, loss function, or
error Function

62
Optimizers in Deep Neural Network

• Optimizers are algorithms or methods used to


minimize an error function(loss function)or to
maximize the efficiency of production.
• Optimizers are mathematical functions which
are dependent on model’s learnable
parameters i.e Weights & Biases.
• Optimizers help to know how to change
weights and learning rate of neural network to
reduce the losses.
Gradient Descent in Machine learning and Deep Learning

• Gradient Descent is a popular optimization technique in


Machine Learning and Deep Learning.
• A gradient is the slope of a function and gradient descent (a
movement down to a lower place) means descending a slope to
reach the lowest point on that surface.
• It measures the degree of change of a variable in response to
the changes of another variable.
• Mathematically, Gradient Descent is a convex function whose
output is the partial derivative of a set of parameters of its
inputs. The greater the gradient, the steeper (bigger) the slope.
• Gradient Descent iteratively reduces a loss function by moving
in the direction opposite to that of steepest path.
• It is dependent on the derivatives of the loss function for finding
minima.
Gradient Descent Optimization

• Gradient Descent is an iterative optimization


algorithm, used to find the minimum value for
a function.
• The general idea is to initialize the parameters
to random values, and then take small steps in
the direction of the “slope” at each iteration.
• Gradient descent is highly used in supervised
learning to minimize the error function and find
the optimal values for the parameters.
Gradient Descent Optimization

Starting Point

Loss

Value of weight
Point of Convergence i.e where the cost function is at
its minimum Level
Gradient Descent
• Gradient descent is a way to minimize an objective
function 𝐽(𝜃)
• 𝐽(𝜃) : Objective function
• 𝜃 ∈ 𝑅𝑑 : Model’s parameters
• 𝜂 : Learning rate. This determines the size of the steps we
take to reach a (local) minimum.

𝛻 𝜃 𝐽(𝜃) 𝐽(𝜃 )
Update equation

𝜃(new) = 𝜃 − 𝜂 ∗ 𝛻 𝜃 𝐽( 𝜃)
𝑜𝑙𝑐𝑎𝑙
𝑚𝑖𝑛𝑖𝑚𝑢 𝑚
𝜃∗ 𝜃
Change in Weight

algorithm: θnew=θ old−α⋅∇J(θ)


Advantages and Disadvantages of Gradient
Descent
• Advantages of Gradient Descent
-Easy computation
-Easy to understand
-Easy to implement
• Disadvantages of Gradient Descent
-May trap at local minima
-Weights are changed after calculating gradient on the whole
dataset. So, if the dataset is too large than this may take
years to converge to the minima.
– Requires large memory to calculate gradient on the whole dataset.
Gradient Descent Variants

• There are three variants of gradient descent.


• Batch gradient descent
• Stochastic gradient descent
• Mini-batch gradient descent
• The difference of these algorithms is
the amount of data.

This term is different


Update equation
with each method
𝜃 = 𝜃 − 𝜂 ∗𝛻 𝜃 𝐽 ( 𝜃 )
Gradient Descent Algorithm

Solution: θ := θ - η∇ θ J(θ)

1. Initiate step size η


2. Start with a random point θ
3. Calculate gradient ∇ J(θ) at point θ
θ

4. Follow the inversed direction of gradient


→ get new θ
5. Repeat until reach minima
a. Stop condition? → gradient is small
enough

71
Stochastic Gradient Descent Learning
Algorithm
Stochastic Gradient Descent Learning Algorithm

• Stochastic gradient descent is an optimization


algorithm used in machine learning
applications to find the model parameters
that correspond to the best fit between
predicted and actual outputs.
• Stochastic gradient descent is widely used in
machine learning applications.
Stochastic Gradient Descent

• The word ‘stochastic‘ means a system or a process


that is linked with a random probability.
• In Stochastic Gradient Descent, a few samples are
selected randomly instead of the whole data set
for each iteration.
• In Gradient Descent, whole dataset is used for
calculating the gradient for each iteration.
• Whole dataset is really useful for getting to the
minima in a less noisy and less random manner,
but the problem arises when our datasets get big. 
• Suppose, we have a million samples in dataset, Gradient
Descent optimization technique uses , all of the one million
samples for completing one iteration, and it has to be done
for every iteration until the minima reached.
• Hence, it becomes computationally very expensive to
perform.
• This problem is solved by Stochastic Gradient Descent.
• In SGD, it uses only a single sample, i.e., a batch size of one,
to perform each iteration.
• The sample is randomly shuffled and selected for
performing the iteration.
Stochastic Gradient Descent

• This algorithm processes one training sample in every


iteration.
• The parameters get updated after every iteration since
only one data sample is worked on in every iteration.
• It is quicker in comparison to batch gradient descent.
• The overhead is high if the number of training samples
in the dataset is large. This is because the number of
iterations would be high and the amount of time taken
would also be high.
• Algorithm Θnew = θ old − α⋅∇J(θ;x(i),y(i))
Where {x(i) ,y(i)} are the training examples.
Stochastic gradient descent
This method performs a parameter update for each training
example 𝑥 i and label 𝑦(𝑖) .

Update equation
We need to calculate the
𝜃 = 𝜃 − 𝜂 ∗ 𝛻𝜃 𝐽(𝜃; 𝑥()i ,𝑦 ( 𝑖 ) ) gradients for the whole dataset
to perform just one update.

Code
Note : we shuffle the training data at every epoch
• Advantages of SDG:
-Frequent updates of model parameters hence,
converges in less time.
-Requires less memory as no need to store
values of loss functions.
-May get new minima’s.
• Disadvantages:
-High variance in model parameters.
-May shoot even after achieving global minima.
Stochastic Gradient Descent With Momentum

• Momentum is a very popular optimization technique that is


used along with SGD.
• Momentum is a hyper-parameter symbolized by gama ‘γ’.
• It is used for reducing high variance in SGD and softens the
convergence.
• Instead of using only the gradient of the current step to guide
the search, momentum also accumulates the gradient of the
past steps to determine the direction to go.

• It accelerates the convergence towards the relevant direction


and reduces the fluctuation to the irrelevant direction.
• Algorithm:
V(t)=γV(t−1)+α.∇J(θ)
• The weights are updated by
θ new=θ old − V(t).
• The value of momentum term γ can lies
between 0<= γ <=1
• Advantages:
• Reduces the oscillations and high variance of
the parameters.
• Converges faster than gradient descent.
• Disadvantages:
• One more hyper-parameter is added which
needs to be selected manually and accurately.
Stochastic Gradient Descent With
Momentum
The process of removing the
noisy data is called momentum
Now suppose
t1 t2 t3 ………tn

b1 b2 b3 ……..bn
Now we create a variable
Vt1 = b1
Vt2 = γ vt1 + b2
= .5 b1 + b2
Vt3 = γvt2 + b3
= γ(γ vt1 + b2) + b3
= γ2 vt1 + γb2 + b3
= .25 b1 + .5b2 + b3
And so on
Difference between Gradient Descent and stochastic
Gradient Descent Algorithms
Gradient Descent Stochastic Gradient Descent
1 Gradient Descent Algorithm uses SGD uses single Training sample data
the whole Training sample Data
2 Slow and computationally Faster and less computationally expensive
expensive algorithm than Batch GD
3 Not suggested for huge training Can be used for large training samples.
samples.
4 Deterministic in nature. Stochastic in nature
5 Gives optimal solution. Gives good solution but not optimal.
6 The data sample should be in a random
No random shuffling of points are order, and this is why we want to shuffle
required.
the training set for every epoch.
7 Convergence is slow. Reaches the convergence much faster.
8 Can’t escape shallow local minima SGD can escape shallow local minima more
easily. easily.
Comparison of trade-offs of Gradient Descent Variants

Update Memory Online


Method Accuracy
Speed Usage Learnin
g
Batch
Goo Slow High No
Gradient descent
d
Stochastic Good (with
High Lo Ye
gradient descent
w s
Mini-batch annealing)
Good Medium Medium
gradient descent
Yes
Table:Comparison of trade-offs of gradient descent variants
Learning Rate
• Learning rate is probably the most important aspect of gradient descent and
also other optimizers as well.
• Example: Imagine the cost function as a pit.
• We will be starting from the top and your objective is to get to the bottom of
the pit.
• We can think of learning rate as the step that we are going to take to reach
the bottom(global minima) of the pit.
• If we choose a large value as learning rate, we would be making drastic
changes to the weights and bias values, i.e we would be taking huge jumps to
reach the bottom.
• There is also a huge probability that we will overshoot the global
minima(bottom) and end up on the other side of the pit instead of the
bottom.
• With a large learning rate, we will never be able to converge to the global
minima
Learning Rate
• If We choose a small value as learning rate, we lose
the risk of overshooting the minima but our algorithm
will take longer time to converge,
• Hence, we would have to train for a longer period of
time.
• Also, if the cost function is non-convex, our algorithm
might be easily trapped in a local minima and it will be
unable to get out and converge to the global minima.
• There is no generic right value for learning rate. It
comes down to experimentation and intuition.
First-order optimization algorithm

• First-order methods use the first derivatives of


the function to minimize the loss function:
-Momentum
-Adagrad
-Adadelta
-RMSprop
-Adam
- Nesterov accelerated gradient (NAG)
Second-order optimization algorithm

• Second-order methods make use of the


estimation of the Hessian matrix (second
derivative matrix of the loss function with
respect to its parameters).
-Newton method
-Conjugate gradient
-Quasi-Newton method
-Levenberg-Marquardt algorithm.
Sparse and Dense Data

Sparse Data
Dense Data
Adaptive gradient (Adagrad) Optimizer

Same
Learning
Rate
GD, SGD, Mini-Batch SGD
Weights are changes learning rate remain same
Idea of Adaptive Gradient
Adagrad Optimizer

In case of Adagrad use


different learning rate
for each iteration
Adagrad Optimizer
Adaptive Gradient Optimizer

To avoid
division
by zero
Loss Function
Adaptive Learning Rate
● Previous algorithm uses the fixed learning rate
throughout the learning process
○ The learning rate has to be either set to be very small at
the beginning or periodically decrease the learning rate

● Adaptive learning rate: learning rate is automatically


decreased in the learning process

● Adaptive learning rate algorithm includes:


-AdaGrad
-Adadelta
-RMSprop
-Adam
10
1
Motivation for Adaptive Learning Rate

• Consider the following simple perceptron


network with sigmoid activation.

We know that for given a single point (x, y), gradients of w would be the following:
Motivation for Adaptive Learning Rate

• Gradient of f(x) w.r.t to a particular weight is clearly


dependent on input values.
• If there are n points, we can just sum the gradients over all
the n points to get the total gradient.
• But what would happen if the feature x2 is very sparse (i.e.,
if its value is 0 for most inputs)?
• It is fair to assume that ∇w2 will be 0 and hence w2 will not
get enough updates.
• To make sure updates happen even when a particular input is
sparse, can we have a different learning rate for each
parameter which takes care of the frequency of the features?
Adaptive Gradient Algorithm (Adagrad)

• The Adaptive Gradient algorithm, or AdaGrad for short,


is an extension to the Stochastic gradient descent
optimization algorithm.
• A limitation of stochastic gradient descent is that it
uses the same step size (learning rate) for each input
variable.
• Previous methods: Same learning rate η for all
parameters θ.
• Adagrad [Duchi et al., 2011] adapts the learning rate to
the parameters (large updates for infrequent
parameters, small updates for frequent parameters).
Adaptive Gradient Algorithm

• AdaGrad is a variation of Stochastic Gradient


optimization algorithms that updates the
learning rate for each parameter.
• Instead of a single universal learning rate as in
case of stochastic gradient, AdaGrad
maintains a per-parameter learning rate,
which considerably improves performance for
problems with sparse gradients, such as natural
language or computer vision problems.
Adaptive Gradient Algorithm

• Previous methods :

• we used the same learning rate for all parameters 𝜽

• Adagrad :

• It uses a different learning rate for every parameter 𝜃𝑖at


every time step 𝑡
Adaptive Gradient Algorithm (Adagrad)

SGD ℝ 𝑑 ×𝑑 ⋯ ⋯




𝐺𝑡 = ⋯ ⋯



⋯ ⋯

𝐺𝑡 is a diagonal matrix where each diagonal


element (𝑖,𝑖) is the sum of the squares of the gradients 𝜃𝑖
up to time step 𝑡.
𝜀 is a smoothing term that avoids division by zero
Adagrad

Vectorize
Adagrad divides the learning rate by the square root
of the sum of squares of gradients.
Adaptive Gradient Algorithm (Adagrad)

SGD ℝ 𝑑 ×𝑑 ⋯ ⋯




𝐺𝑡 = ⋯ ⋯



⋯ ⋯

Adagrad modifies the learning rate 𝜼


based on the past gradients that have
been computed for 𝜽𝒊
Adagrad

Vectorize
Advantages and Disadvantages of Adagrad

• Advantages :
• It is well-suited for dealing with sparse data (missing or
gaps in the data).
• It greatly improved the robustness of SGD.
• It eliminates the need to manually tune the learning rate.
• Disadvantage :
• Main weakness is its accumulation of the squared
gradients in the denominator
Adadelta
• Instead of inefficiently storing, the sum of gradients
is recursively defined as a decaying average of all
past squared gradients.
• We defines running average of squared gradients
E [g 2]t at time t:

• 𝐸[𝑔 2 ] : The running average at time step 𝑡.

• 𝛾 : A fraction similarly to the Momentum term, around 0.9


Where gt is the gradient and given as
Adadelta
Adagrad SGD

Replace the diagonal matrix 𝐺𝑡 with the decaying


average over past squared gradients 𝐸 [ 𝑔 2 ] 𝑡

Adadelta
Adadelta
Adagrad SGD

Replace the diagonal matrix 𝐺𝑡 with the decaying


average over past squared gradients 𝐸 [ 𝑔 2 ]𝑡

Adadelta
• Advantages:
-The learning rate does not decay and the
training does not stop.
• Disadvantages:
-Computationally expensive.
Root Mean Square Propagation
(RMSprop)
RootMeanSquarePropagation
(RMSprop)
• The Root Mean Square Propagation RMS Prop is
similar to Momentum,
• It is a technique to reduce the motion in the y-axis
and speed up gradient descent.
• For better understanding, let us denote the Y-axis
as the bias b and the X-axis as the weight W.
• It is called Root Mean Square because we square
the derivatives of both w and b parameters.
• RMS prop is a gradient based optimization
technique used in training neural networks.
• It was proposed by the father of back-
propagation, Geoffrey Hinton 2014.
• Gradients of very complex functions like
neural networks have a tendency to either
vanish or accelerate as the data propagates
through the function.
Root Mean Square Propagation (RMSprop)
• AdaGrad can result in a premature and
excessive decrease in learning rate
• RMSProp modifies AdaGrad to perform better
in non-­‐convex surfaces
• Changes gradient accumulation by an
exponentially decaying average of sum of
squares of gradients
Root Mean Square Propagation (RMSprop)

Weight updation in RMSprop is given as

The running average of squared gradients E [g 2]t at time t is


given by Hinton and suggested gama = .9 and 𝜂 = 0.001

RMSprop
Adam (adaptive moment Estimation)

• Adam is a replacement optimization


algorithm for stochastic gradient descent for
training deep learning models.
• Adam combines the best properties of the
AdaGrad and RMSProp algorithms to provide
an optimization algorithm that can handle
sparse gradients on noisy problems.
Adam Optimizer (adaptive moment Estimation)
• Adam was presented by Diederik Kingma in 2015

• Adam combining the advantages of two extensions of


stochastic gradient descent, specifically:
• Adaptive Gradient Algorithm (AdaGrad) that maintains a
per-parameter learning rate that improves performance on
problems with sparse gradients (e.g. natural language and
computer vision problems).
• Root Mean Square Propagation (RMSProp) that also
maintains per-parameter learning rates that are adapted based
on the average of recent magnitudes of the gradients for the
weight (e.g. how quickly it is changing).
Adam (Adaptive Moment Estimation)

• The intuition behind the Adam is that we don’t


want to roll so fast just because we can jump
over the minimum, we want to decrease the
velocity a little bit for a careful search.
• In addition to storing an exponentially decaying
average of past squared gradients like
AdaDelta, Adam also keeps an exponentially
decaying average of past gradients M(t).
• M(t) and V(t) are values of the first moment
which is the Mean and the second moment
which is the uncentered variance of the
gradients respectively.

Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to
E[g(t)] where, E[f(x)] is an expected value of f(x).
• To update the parameter:

• The values for β1 is 0.9 , 0.999 for β2, and (10


x exp(-8)) for ‘ϵ’.
Advantages of Adam
• Straightforward to implement.
• Computationally efficient.
• Little memory requirements.
• Well suited for problems that are large in terms of data and/or
parameters.
• Appropriate for non-stationary objectives.
• Appropriate for problems with very noisy/or sparse gradients.
• Hyper-parameters have intuitive interpretation and typically
require little tuning.
Nesterov Accelerated Gradient
(NAG)
Batch Gradient Descent
Nesterov Accelerated Gradient

θ(New) = θ (old) - vt
Nesterov Accelerated Gradient

• Momentum may be a good method but if the momentum


is too high the algorithm may miss the local minima and
may continue to rise up.
• To resolve this issue the NAG algorithm was developed.
• We know we’ll be using γV(t−1) for modifying the
weights so, θ−γV(t−1) approximately tells us the future
location.
• Now, we’ll calculate the cost based on this future
parameter rather than the current one.
• V(t)=γV(t−1)+α. ∇J( θ−γV(t−1) ) and then update the
parameters using θ(new) = θ (old) −V(t).
• Advantages:
• Does not miss the local minima.
• Slows if minima’s are occurring.
• Disadvantages:
• Still, the hyper parameter needs to be
selected manually.
Saddle point Problem
Saddle point Problem

AT saddle point derivate is zero and weights are not updated


Saddle point is not minimum point nor maximum point
Weights are stuck at saddle point
This problem exist in a non-convex function
In three dimensional
F(x, y) = x2 – y2
d(fx,y)/dx = 2x , Equate to zero, 2x = 0, X = 0 Saddle point
Also
df(x,y)/dy = -2y, Equate to zero, -2y = 0, Y = 0
At (0,0), df(x,y)/dx = 0 and df(x,y)/dy = 0
So at x=0 it is Local minima, And y = 0, it is Local maxima
We should not stuck at this saddle point
How to move away from saddle point Saddle point
At saddle point if we take a long jump in y direction we will
minimize the y value
x(new) = x(old) – α df/dx (df/dx = 0, so x(new) = x(old)
y(new) = y(old) - α df/dy (df/dy = 0, so y(new) = y(old)
How to avoid this situation
x(new) = x(old) – α (df/dx + 20) (adding some value)
Y-
y(new) = y(old) - α (df/dy + 20) (adding some value in y we will
X axis axis
take a long jump and we minimize the y value and after
applying gradient descent we will reach towards global minima

You might also like