Professional Documents
Culture Documents
Ilovepdf Merged Unit 2 Compressed
Ilovepdf Merged Unit 2 Compressed
• Batch Normalization
Normalization vs. Standardization
Covariant Shift
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why Normalization
Example,
Example,
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Bias Variance Tradeoff
Bias and variance typically trade off in relation to model complexity
To optimize the value of the total error for the model by using
the Bias-Variance Tradeoff:
The best fit will be given by the hypothesis on the tradeoff point.
One possible strategy: Take the average of the last several values.
This might work in certain cases but it is not very suitable for scenarios
when a parameter is more dependent on the most recent values.
It is based on the assumption that more recent values of a variable contribute more to the formation
of the next value than precedent values.
The famous
second
wonderful limit • Taking β = 0.9 indicates that
approximately in t = 10 iterations, the
By making a weight decays to 1 / e, compared to the
substitution weight of the current observation.
β=1-x • In other words, the exponential
weighted average mostly depends only
on the last t = 10 observations.
As in the equation for the exponential moving
average, every observation value is multiplied by a
term βᵗ . Then on comparing both forms:
• In this example, the starting point and the local minima have different horizontal coordinates and are almost equal vertical
coordinates.
• Using gradient descent to find the local minima will likely make the loss function slowly oscillate towards vertical axes.
• These bounces occur because gradient descent does not store any history about its previous gradients making
gradient steps more undeterministic on each iteration.
Thus, large learning rate disconvergence.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
Gradient Descent
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
It would be desirable to make a loss function performing larger
steps in the horizontal direction and smaller steps in the
vertical.
Momentum uses a pair of
equations at each iteration:
Exponentially moving
average for gradient
values dw The momentum term increases for dimensions
Normal gradient descent whose gradients point in the same directions
update using the computed and reduces updates for dimensions whose
moving average value on the gradients change directions. As a result, we
current iteration. gain faster convergence and reduced oscillation
(An overview of gradient descent optimization
algorithms∗ Sebastian Ruder)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
Instead of simply using them for updating weights, we take several Momentum usually converges
past values and literaturally perform update in the averaged direction. much faster than gradient
descent. With Momentum,
there are also fewer risks in
using larger learning rates,
thus accelerating the training
process.
Optimization
with Momentum
In Momentum, it is
recommended to choose
β close to 0.9.
projected
gradient
V initialised to 0
The intuition is that the standard momentum method first computes the
gradient at the current location and then takes a big jump in the direction of the
updated accumulated gradient. In contrast Nesterov momentum first makes a
big jump in the direction of the previous accumulated gradient and then
measures the gradient where it ends up and makes a correction. The idea being
that it is better to correct a mistake after you have made it.
AdaGrad (white) vs. gradient descent (cyan) on a terrain with a saddle point. The learning rate of AdaGrad is set to be
higher than that of gradient descent, but the point that AdaGrad’s path is straighter stays largely true regardless of
learning rate.This property allows AdaGrad (and other similar
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AdaGrad (Adaptive Gradient Algorithm)
From the animation, it can
be seen that Adagrad
might converge slower
compared to other
methods. This could be
because the accumulated
gradient in the
denominator causes the
learning rate to shrink and
become very small,
thereby slowing down the
learning over time.
last square
gradient at
every iteration
• A little positive aspect about this algorithm is the fact only a single bit is required to
store signs of gradients which can be handy in distributed computations with strict
memory requirements.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSProp (Root Mean Square Propagation)
RMSProp was elaborated as an improvement over AdaGrad which tackles the
issue of learning rate decay. exponentially moving average
• However, instead of storing a cumulated sum of squared
gradients dw² for vₜ, the exponentially moving average is
calculated for squared gradients dw².
RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run also shows the
sum of gradient squared represented by the squares.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)
• Adam is the most famous optimization algorithm in deep learning.
• Adam combines Momentum and RMSProp algorithms. To achieve it, it simply keeps
track of the exponentially moving averages for computed gradients and squared
gradients respectively.
• Furthermore, it is possible to use bias correction for moving averages for a more
precise approximation of gradient trend during the first several iterations.
• The experiments show that Adam adapts well to almost any type of neural network
architecture taking the advantages of both Momentum and RMSProp.
first
momentum.
Updated weight
Second momentum.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)
Disadvantage
It doesn’t focus on data points rather focus on computation time
Note: So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary- Optimizers
Remember:
Optimization through gradient descent
W ←W − ƞ
Remember:
Optimization through gradient descent
W ←W − ƞ
J(W)
Initial guess
W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Large learning rates overshoot, become unstable and diverge which is more undesirable.
J(W)
Initial guess
W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Setting learning rate is very challenging.
• Stable learning rates converge smoothly and avoid local minima
J(W)
Initial guess
W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape
constant
small positive to
different learning avoid division by 0
rates at each iteration
𝝏𝑱 𝝎
𝒗 𝒘, 𝒕 + 𝟏 = 𝜸 𝒗 𝒘, 𝒕 + (1- 𝜸) ( )
𝝏𝝎
Momentum or
forgetting factor,
usually 0.9
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSprop (Root mean square propogation)
• Advantage:
It reduces monotonical decrease in learning rate as in
AdaGrad.
• mt and vt initialized as 0,it is observed that they gain a tendency to be ‘biased towards 0’ as
both β1 & β2 ≈ 1. fixes this problem by computing ‘bias-corrected’ mt and vt. This control
the weights while reaching the global minimum to prevent high oscillations when near it.
• Algorithm has a faster running time, low memory requirements, and requires less tuning
than any other optimization algorithm.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 3
Overfitting and
underfitting bias variance trade
off
Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation
Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation
Training set
Many neurons
Slide source: Coding Lane
Regularization
Cost function =
For linear regression line, let’s consider two
points that are on the line,
Cost function =
For linear regression line, let’s consider two 1.96
points that are on the line,
Cost function =
For ridge regression line, let’s assume, Ridge regression line
0.63
=
=1
= 0.7
Then, Cost function =
Cost function =
Here,
= Sum of the squared residuals
= Penalty for the errors
= Slope of the curve/line
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing • Number of Iterations
(epochs) is a
Training
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training (epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing have a chance to overfit
Training • Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here!
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing
have a chance to overfit
Training
• Number of Iterations
Under-fitting Over-fitting
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here! (Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we have a chance
to overfit
• Number of Iterations (epochs) is a
hyperparameter
• Less epochs=> Suboptimal solution
(Underfit)
• Too many epochs=> Overfitting
https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
Amity Centre for Artificial Intelligence, Amity University, Noida, India
if we stabilize the input values for
each layer (defined as z = Wx +
b, where z is the linear
transformation of the W
weights/parameters and the biases),
we can prevent our activation
Fig. From gradient it can be observed that
function from putting our input larger z , the function approaches zero, When
values into the max/minimum network’s nodes exist in this space, training
values of our activation function slows down significantly, since gradient values
decrease.
no. of neurons
at layer h
• Normalize the hidden activations by this subtracting the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).
• γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the
vector containing values from the previous operations.