You are on page 1of 218

Deep learning

• Batch Normalization
Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Covariant Shift
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep
Neural
Networks
(alternate
Explaination:
Bias Variance
Trade-off)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.

High Bias Large error in training as well as testing data

Hypothesis is too simple or linear in nature

The data predicted is in a straight line format, thus


not fitting accurately in the data in the data set.

High Bias in the Model


Underfitting

Example,

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Variance
Variance: The variability of model prediction for a given data point which tells us the spread of the
data .

High Variance Very complex fit to the training data

Not able to fit accurately on the data which it


hasn’t seen before (Test Data)

Models perform very well on training data but have


high error rates on test data
High Variance in the Model
Overfitting

Example,
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Bias Variance Tradeoff
Bias and variance typically trade off in relation to model complexity

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias Variance Tradeoff
If the algorithm is too
High bias and
simple (hypothesis
Low variance
with linear equation)
condition

If algorithms fit too


High variance
complex (hypothesis with
and low bias.
high degree equation)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias Variance Tradeoff
An algorithm can’t be more complex and less complex at the same time.

To optimize the value of the total error for the model by using
the Bias-Variance Tradeoff:

The best fit will be given by the hypothesis on the tradeoff point.

This is referred to as the best point chosen for the training of


the algorithm which gives low error in training as well as
testing data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias Variance Tradeoff
1.High Bias and High Variance
(The Worst-Case Scenario)

2.Low Bias and Low Variance


(The Best-Case Scenario)

3. Low Bias and High Variance


(Overfitting)

4. High Bias and Low Variance


(Underfitting)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Exponential Moving Average
Task: Approximating a given parameter that changes in time where,
we are aware of all of its previous values. The objective is to predict
the next value which depends on the previous values.

One possible strategy: Take the average of the last several values.
This might work in certain cases but it is not very suitable for scenarios
when a parameter is more dependent on the most recent values.

Second possible strategy: To distribute higher weights to more recent


values and assign fewer weights to prior values.

Exponential Moving Average

It is based on the assumption that more recent values of a variable contribute more to the formation
of the next value than precedent values.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average

•vₜ is a time series that approximates a given


variable. Its index t corresponds to the
timestamp t.
•The value v₀ for the initial timestamp t = 0 is
usually taken as 0.
•θ is the observation on the current
iteration.
•β is a hyperparameter between 0 and 1
which defines how weight importance
should be distributed between a previous
average value vₜ-₁ and the current
observation θ

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average

Exponential moving average for the t-th timestamp

• The most recent observation θ has a weight of


1, the second last observation — β, the third last
— β², etc.
• Since 0 < β < 1, the multiplication term βᵏ goes
exponentially down with the increase of k, so
the older the observations, the less important In practice, the value for β is usually chosen close to 0.9.
they are.
• Finally, every sum term is multiplied by (1 —β).

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average By using this equation, for a chosen value of β, we can
compute an approximate number of timestamps t it takes for
weight terms to reach the value of 1 / e ≈ 0.368).
Mathematical Interpretation

The famous
second
wonderful limit • Taking β = 0.9 indicates that
approximately in t = 10 iterations, the
By making a weight decays to 1 / e, compared to the
substitution weight of the current observation.
β=1-x • In other words, the exponential
weighted average mostly depends only
on the last t = 10 observations.
As in the equation for the exponential moving
average, every observation value is multiplied by a
term βᵗ . Then on comparing both forms:

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average
Bias correction
• The common
problem with using
exponential
weighted average is
that in most
problems it cannot
approximate well
the first series
values. Case 1: v₀ = 0 Case 2: v₀ = value of first observation θ₁
• It occurs due to the Though this approach works well in some situations, it is still not
Then the first several values will
absence of a perfect, especially in cases when a given sequence is volatile. For
put a large weight on v₀ which is 0
sufficient amount example, if θ₂ differs too much from θ₁
whereas most of the points on
of data on the first
the scatterplot are above 20.
iterations.
Imprecise Approximation It will also result in poor Approximation for volatile data

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average
Bias correction
• The solution is to use a
technique called “bias
correction”.
• Instead of simply using
computed values vₖ, they are
divided by (1 —βᵏ). Assuming
that β is chosen close to 0.9–1,
this expression tends to be
close to 0 for first iterations
where k is small.
• Thus, instead of slowly
accumulating the first several
values where v₀ = 0, they are
now divided by a relatively
small number scaling them into
larger values.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Exponential Moving Average
Bias correction

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Gradient Descent : Representation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient descent
Gradient descent is the simplest optimization
algorithm which computes gradients of loss function
with respect to model weights and updates them. Gradient descent equation

w is the weight vector,


dw is the gradient of w,
α is the learning rate,
t is the iteration number

Optimization problem with gradient descent in a ravine area.


Blue: starting point
Black: Local minimum area where the surface is much more
steep in one dimension than in another
Courtesy: towardsdatascience
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Gradient descent

• In this example, the starting point and the local minima have different horizontal coordinates and are almost equal vertical
coordinates.
• Using gradient descent to find the local minima will likely make the loss function slowly oscillate towards vertical axes.
• These bounces occur because gradient descent does not store any history about its previous gradients making
gradient steps more undeterministic on each iteration.
Thus, large learning rate  disconvergence.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need better optimization algorithms?
• In practice during Gradient Decent
technique can run into certain problems
during training that can slow down the
learning process or, in the worst case,
even prevent the optimal weights from
being found.
• These problems are, on the one hand,
so-called saddle points and, on the
other hand, local minima of the loss local minima Saddle point
function. At the saddle points and the
local minima the loss function becomes
flat and the gradient at this point goes
towards zero.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent

• A gradient close to zero in a saddle


point or in a local minimum does
not improve the weight parameters
and prevents the whole learning
process.
• results in a zig-zag motion towards
the optimal weights and can slow
down learning a lot

Gradient Descent
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
It would be desirable to make a loss function performing larger
steps in the horizontal direction and smaller steps in the
vertical.
Momentum uses a pair of
equations at each iteration:

Exponentially moving
average for gradient
values dw The momentum term increases for dimensions
Normal gradient descent whose gradients point in the same directions
update using the computed and reduces updates for dimensions whose
moving average value on the gradients change directions. As a result, we
current iteration. gain faster convergence and reduced oscillation
(An overview of gradient descent optimization
algorithms∗ Sebastian Ruder)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
Instead of simply using them for updating weights, we take several Momentum usually converges
past values and literaturally perform update in the averaged direction. much faster than gradient
descent. With Momentum,
there are also fewer risks in
using larger learning rates,
thus accelerating the training
process.
Optimization
with Momentum

In Momentum, it is
recommended to choose
β close to 0.9.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Momentum
Momentum technique is an
approach which provides an
update rule that is motivated from
the physical perspective of
optimization. Imagine a ball in a
hilly terrain is trying to reach the
deepest valley. When the slope of
the hill is very high, the ball gains a
lot of momentum and is able to
pass through slight hills in its way.
As the slope decreases the
momentum and speed of the ball
decreases, eventually coming to
rest in the deepest position of
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a valley.
global minimum (the left well) and local minimum (the right well)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum • In general, velocity can be seen to increase
with time. By using the momentum term,
saddle points and local minima become less
dangerous for the gradient. This is because the
step size toward the global minimum now
depends not only on the slope of the loss
function at the current point, but also on the
velocity that has built up over time.

The advantage of momentum is that it


makes very small change to SGD but
provides a big boost to speed of learning.
We need to store the velocity for all the
parameters, and use this velocity for
SGD (black) vs. SGD with momentum (blue) making the updates.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Nesterov Accelerated Gradient
Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up. So, to resolve
this issue the NAG algorithm was developed. It is a look ahead method.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient
Nesterov Accelerated Gradient is a momentum-based SGD optimizer that
"looks ahead" to where the parameters will be to calculate the gradient ex post
rather than ex ante:

projected
gradient
V initialised to 0

Like SGD with momentum (β) is usually set to 0.9.


The projected gradient value can be obtained by going ‘one step ahead’ using the previous velocity. This
means that for this time step t, there need to carry out another forward propagation before executing the
backpropagation.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Nesterov Accelerated Gradient
Steps:
1.Update the current weight wt to a projected weight w* using the
previous velocity.

Carry out forward propagation, but using this projected weight.

3.Obtain the projected gradient ∂L/∂w*.

4.Compute Vt and wt+1 accordingly.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient

The intuition is that the standard momentum method first computes the
gradient at the current location and then takes a big jump in the direction of the
updated accumulated gradient. In contrast Nesterov momentum first makes a
big jump in the direction of the previous accumulated gradient and then
measures the gradient where it ends up and makes a correction. The idea being
that it is better to correct a mistake after you have made it.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)
(to adapt the learning rate to computed gradient values.)
• There might occur situations
when during training, one Adagrad accumulates element-wise squares dw² of gradients
component of the weight from all previous iterations.
vector has very large gradient
values while another one has
extremely small. During weight update, instead of using normal learning rate α,
• This happens especially in AdaGrad scales it by dividing α by the square root of the
cases when an infrequent accumulated gradients √vₜ.
model parameter appears to
have a low influence on
predictions.
• The same problem can occur
with sparse data where there a small positive term ε is added to
is too little information about the denominator to prevent
potential division by zero.
certain features

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)
Advantage:
The greatest advantage of AdaGrad is that
there is no longer a need to manually adjust
the learning rate as it adapts itself during
training.

• AdaGrad deals with the aforementioned


problem by independently adapting the learning
rate for each weight component.
• If gradients corresponding to a certain weight
vector component are large, then the respective
learning rate will be small.
• Inversely, for smaller gradients, the learning rate
will be bigger. This way, Adagrad deals with
vanishing and exploding gradient problems.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)
Disadvantage:
• The learning rate constantly
decays with the increase of
iterations (the learning rate is
always divided by a positive
cumulative number).
Therefore, the algorithm
tends to converge slowly
during the last iterations
where it becomes very low.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)

AdaGrad (white) vs. gradient descent (cyan) on a terrain with a saddle point. The learning rate of AdaGrad is set to be
higher than that of gradient descent, but the point that AdaGrad’s path is straighter stays largely true regardless of
learning rate.This property allows AdaGrad (and other similar
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AdaGrad (Adaptive Gradient Algorithm)
From the animation, it can
be seen that Adagrad
might converge slower
compared to other
methods. This could be
because the accumulated
gradient in the
denominator causes the
learning rate to shrink and
become very small,
thereby slowing down the
learning over time.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Issue with a squared gradient for vₜ :
• Transformation equations when using a squared gradient:

last square
gradient at
every iteration

•If dw > 0, then a weight w is decreased by α.


•If dw < 0, then a weight w is increased by α.
• Thus, if vₜ = dw², then model weights can only be changed by ±α.
• Though this approach works sometimes, it is still not flexible the algorithm becomes
extremely sensitive to the choice of α and absolute magnitudes of gradient are ignored
which can make the method tremendously slow to converge.

• A little positive aspect about this algorithm is the fact only a single bit is required to
store signs of gradients which can be handy in distributed computations with strict
memory requirements.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSProp (Root Mean Square Propagation)
RMSProp was elaborated as an improvement over AdaGrad which tackles the
issue of learning rate decay. exponentially moving average
• However, instead of storing a cumulated sum of squared
gradients dw² for vₜ, the exponentially moving average is
calculated for squared gradients dw².

• Experiments show that RMSProp generally converges faster


than AdaGrad because, with the exponentially moving
average, it puts more emphasis on recent gradient values
rather than equally distributing importance between all
gradients by simply accumulating them from the first iteration.

• Furthermore, compared to AdaGrad, the learning rate in


RMSProp does not always decay with the increase of iterations
making it possible to adapt better in particular situations.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RMSProp (Root Mean Square Propagation)

In RMSProp, it is recommended to choose β close to 1.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RMSProp (Root Mean Square Propagation)

RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run also shows the
sum of gradient squared represented by the squares.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)
• Adam is the most famous optimization algorithm in deep learning.
• Adam combines Momentum and RMSProp algorithms. To achieve it, it simply keeps
track of the exponentially moving averages for computed gradients and squared
gradients respectively.
• Furthermore, it is possible to use bias correction for moving averages for a more
precise approximation of gradient trend during the first several iterations.
• The experiments show that Adam adapts well to almost any type of neural network
architecture taking the advantages of both Momentum and RMSProp.

first
momentum.

Updated weight
Second momentum.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)

According to the Adam paper (https://arxiv.org/pdf/1412.6980.pdf), good default values for


hyperparameters are β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Role of first moment and second moment play in adaptively
adjusting the learning rate
First Moment:
•Also known as the mean squared gradient, it represents the exponentially
decaying average of past gradients for each parameter.
•Imagine it as a "moving average" of how steeply the loss function changes in the
direction of each parameter.
•This helps to track the overall trend of the gradient, preventing Adam from being
overly affected by sudden spikes or fluctuations.
•Its contribution is to provide a smoother and more stable direction for updating
the weights compared to using just the current gradient.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Role of first moment and second moment play in adaptively
adjusting the learning rate
Second Moment:
•Also known as the RMSprop squared gradient , it represents the exponentially decaying
average of squared past gradients for each parameter.
•Think of it as a measure of how "jumpy" or volatile the recent changes in the
gradient have been for each parameter.
•If the second moment is high, it indicates significant fluctuations, and Adam reduces the
learning rate for that parameter, preventing it from overshooting the minimum loss.
•Conversely, a low second moment suggests consistent improvement, and Adam allows a
faster learning rate for that parameter.
•The contribution of the second moment is to dynamically adjust the learning rate for
each parameter, preventing overshooting and allowing faster convergence in areas with
smoother changes.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Steps Involved in the Adam Optimization Algorithm
1. Initialize the first and second moments’ moving averages (v and s) to zero.
2. Compute the gradient of the loss function to the model parameters.
3. Update the moving averages using exponentially decaying averages. This involves
calculating vt and st as weighted averages of the previous moments and the
current gradient.
4. Apply bias correction to the moving averages, particularly during the early
iterations.
5. Calculate the parameter update by dividing the bias-corrected first moment by the
square root of the bias-corrected second moment, with an added small constant
(epsilon) for numerical stability.
6. Update the model parameters using the calculated updates.
7. Repeat steps 2-6 for a specified number of iterations or until convergence.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Advantage
It tends to focus on faster computation time, whereas algorithms like stochastic
gradient descent focus on data points. That’s why algorithms like SGD
generalize the data in a better manner at the cost of low computation speed.
So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.

Disadvantage
It doesn’t focus on data points rather focus on computation time

Note: So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Visualizations of various optimization algorithms.

Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary- Optimizers

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 1
Loss Function

Amity Centre for Artificial Intelligence, Amity University, Noida, India


“Visualizing the loss
landscape of neural
nets”. Dec 2017.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

W ←W − ƞ

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

W ←W − ƞ

• Learning rate for training the network.


• It has a high impact in performance of the model.
• How can we set the learning rate?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Setting the Learning Rate
• Setting smaller learning rate means not trusting the gradient.
• Small learning rate converges slowly and gets stuck in false local minima.

J(W)

Initial guess

W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Large learning rates overshoot, become unstable and diverge which is more undesirable.

J(W)

Initial guess

W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Setting learning rate is very challenging.
• Stable learning rates converge smoothly and avoid local minima

J(W)

Initial guess

W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


How to deal with setting learning rate?

Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly

Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

Amity Centre for Artificial Intelligence, Amity University, Noida, India


How to deal with setting learning rate?

Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly

Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adaptive Learning Rates

• Learning rates are no longer fixed


• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Summary
• Loss function: Compares the target and predicted output values to measures
how well the neural network models the training data.
• Types of Loss Function:
• Regression loss
• Classification loss
• Learning rate: is a hyper-parameter used to govern the pace at which an
algorithm updates or learns the values of a parameter estimate.
• Setting an adaptive learning rate is a better solution to fixed learning rate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adaptive Learning Rates
Algorithm Tensorflow implementation
• Adam
• Adadelta
• Adagrad
• RMSProp

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adagrad (Adaptive Gradient Descent)
• In this change in learning rate depends upon the difference in parameters
during training. The more the parameters get changed, the more minor the
learning rate changes. The formula to update the weights.
𝝏𝑱 𝝎
𝒕 𝟏= 𝒕 𝒕 𝝏𝝎

constant

small positive to
different learning avoid division by 0
rates at each iteration

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adagrad (Adaptive Gradient Descent)

• Advantage: It abolishes the need to modify the learning rate manually. it


reaches convergence at a higher speed.

• Disadvantage: It decreases the learning rate aggressively and monotonically.


There might be a point when the learning rate becomes extremely small,
because the squared gradients in denominator keep accumulating, and thus
the denominator increasing. Due to small learning rates, the model
eventually becomes unable to acquire more knowledge, thus, accuracy of
the model is compromised.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RMSprop (Root mean square propogation)
• It uses sign of the gradient, adapting the step size (momentum) individually
for each weight.
• Two gradients are first compared for signs. For same sign- going in right
direction - Increase the step size by a small fraction. For opposite signs -
decrease the step size.
• The algorithm keeps the moving average of squared gradients for every
weight and divides the gradient by the square root of the mean square.
𝒏 𝝏𝑱 𝝎
𝑾𝒕 𝟏 = 𝒕 𝒗 𝒘,𝒕 𝝏𝝎

𝝏𝑱 𝝎
𝒗 𝒘, 𝒕 + 𝟏 = 𝜸 𝒗 𝒘, 𝒕 + (1- 𝜸) ( )
𝝏𝝎

Momentum or
forgetting factor,
usually 0.9
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSprop (Root mean square propogation)

• Advantage:
It reduces monotonical decrease in learning rate as in
AdaGrad.

• Disadvantage: It doesn’t work well with large datasets but


with mini-batches of data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adadelta

• AdaDelta is a stochastic optimization technique that allows for


per-dimension learning rate method for SGD.

• It is an extension of Adagrad that seeks to reduce its aggressive,


monotonically decreasing learning rate.

• Instead of accumulating all past squared gradients, Adadelta


restricts the window of accumulated past gradients to a fixed size
w.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adadelta

Amity Centre for Artificial Intelligence, Amity University, Noida, India


m.

Adam (Adaptive moment estimation)


• Adam optimizer updates the learning rate for each network weight individually.
• The first moment is mean, and the second moment is uncentered variance (meaning we
don’t subtract the mean during variance calculation).
𝝏𝑱 𝝎
• 𝒕 𝟏 𝒕 𝟏 + (1- 𝟏) 𝝏𝝎 Bias corrected
𝝏𝑱 𝝎 estimators for
• 𝒕 𝟐 𝒕 -1 + (1- 𝟐) 𝝏𝝎 the first and
second
moments.

• mt and vt initialized as 0,it is observed that they gain a tendency to be ‘biased towards 0’ as
both β1 & β2 ≈ 1. fixes this problem by computing ‘bias-corrected’ mt and vt. This control
the weights while reaching the global minimum to prevent high oscillations when near it.
• Algorithm has a faster running time, low memory requirements, and requires less tuning
than any other optimization algorithm.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Visualizations of various optimization algorithms.

Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 3
Overfitting and
underfitting bias variance trade
off

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• The model is too simplistic and not
able to learn enough from the training
data

• Hence it reduces the accuracy and


produces unreliable predictions.

• How to avoid Underfitting?


• By increasing the training time of The model is unable to capture the data points
present in the plot.
the model.
• By increasing the number of Source:- https://www.javatpoint.com/overfitting-
features. and-underfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• The model is too simplistic and not able
to learn enough from the training data

• Hence it reduces the accuracy and


produces unreliable predictions.

• Reason for Underfitting?


• Data used for training is not cleaned
and contains noise (garbage values)
in it The model is unable to capture the data points
• The model has a high bias present in the plot.
• The size of the training dataset used
is not enough Source:- https://www.javatpoint.com/overfitting-
• The model is too simple and-underfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• When learning a model we have a set of data (training set)
that we use to learn the model parameters
• The evaluation of the model needs to happen out-of-sample,
i.e., on a different set that was not used for learning model
parameters
• One of the most common problems during training is tying
the model to the training set
– Overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• When a model is over fitted it is not expected to perform well
to new data
– It is not generalizable

• Overfitting occurs when the model chosen is too complex that


ends up describing the noise in the data instead of the trend
– E.g., too many parameters relative to the size of the training dataset
– An over fitted model memorizes the training instances and does not
learn the general trend in them

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Data used for training is not cleaned and contains noise
(garbage values) in it

•The model has a high variance

•The size of the training dataset used is not enough

•The model is too complex

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Bias of a Model: Underlying assumptions to make learning possible.
Simpler model=>More assumption=> High Bias

• Variance of a Model: Variability of model for given data points, Model


with high variance pays a lot of attention to training data, may end up
memorizing data rather than learning from it

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
• If we want to minimize MSE, we need to minimize both bias and variance
• However, when bias gets smaller, variance increases and vice versa
• A model that is underfitted has high bias
– Misses relevant relations between the independent variables and the
response variable
– Bias is reduced by increasing model complexity
• A model that is overfitted has high variance
• The model captures the noise in the training data instead of the trend
• Variance is reduced by decreasing model complexity

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Trading off goodness of fit against complexity of the model

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• The real aim of supervised learning is to do well on test data that is not known
during learning
• Choosing the values for the parameters that minimize the loss function on the
training data is not necessarily the best policy
• Generalization refers to How well the model trained on the training data
predicts the correct output for new instances
• We want the learning machine to model the true regularities in the data
and to ignore the noise in the data.
• But the learning machine does not know which regularities are real and
which are accidental quirks of the particular set of training examples we
happen to pick
• So how can we be sure that the machine will generalize correctly to new
data?
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Model Selection: Which model is best?

Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Simple model has
less parameters to be Complex model has more
learned parameters to be learned
(Low complexity, (High complexity,
low capacity) High capacity)

Model may Underfit, it may Model may Overfit, it may


not capture underlying trend start learning from noise
of the data and inaccurate data entries
Higher error for Lower error for
training data, may give training data, may give
high error for validation data higher error for validation
also data
High Bias, Low Variance Low Bias, High Variance

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 4
How to avoid overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Problem of overfitting

Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Train with more data to avoid overfitting,
regularize the model
• Capturing and labeling of data is usually
expensive
• New data is generated from existing data,
with the help of
• Image rotations,
• Translation
• Blur, include noise
• Change brightness
• scaling
• flips (up down, left right)
• and so on
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Very Deep

Training set
Many neurons
Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Very Deep

Many neurons Regularization


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Types of Regularization

Regularization

Ridge (L2) Lasso (L1) Elastic Net


Regularization Regularization Regularization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Cost function =
For linear regression line, let’s consider two
points that are on the line,

= Sum of the squared residuals


= Penalty for the errors
= slope of the curve/line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Cost function =
For linear regression line, let’s consider two 1.96
points that are on the line,

= 0 (considering the two points on the


line) Linear regression line
=1
= 1.4
Then, Cost function =
Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Cost function =
For ridge regression line, let’s assume, Ridge regression line

0.63
=
=1
= 0.7
Then, Cost function =

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Comparing the two models, with all data points,


we can see that the Ridge regression line fits the Ridge regression line
1.96
model more accurately than the linear
0.63
regression line

Linear regression line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the absolute values of coefficients.

Cost function =
Here,
= Sum of the squared residuals
= Penalty for the errors
= Slope of the curve/line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the absolute values of coefficients.

Comparing the two models, with all data points,


we can see that the Lasso regression line fits the Lasso regression line
1.4
model more accurately than the linear 0.8
regression line

Linear regression line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients
and sum of the absolute values of coefficients.

It is the combination of Ridge and Lasso regularization


Cost function =
Here,
= Sum of the squared residuals
= Penalty for the errors
= Slope of the curve/line

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge Lasso Elastic Net
Useful when we have many Preferred when we are Preferred when we do not
variables with relatively fitting a linear model with know whether we want
smaller data samples fewer variables shrinkage or sparsity in the
parameter space.
Ridge will reduce the impact Lasso will eliminate many Elastic Net combines
of features that are not features, and reduce feature elimination from
important in predicting overfitting in the linear Lasso and feature
output values model. coefficient reduction from
the Ridge model to improve
the model predictions.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• During training, some number of nodes
are randomly ignored or “dropped out”
• During weight updation, the layer
configuration appears “new”
• Provides Regularization by avoiding co-
adaption between network layers to
correct mistakes from prior layers
• Improves generaliza on of the model
• Useful in Wider Networks to avoid
overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Stop training before we
have a chance to overfit
• Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing • Number of Iterations
(epochs) is a
Training
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training (epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing have a chance to overfit
Training • Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here!
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing
have a chance to overfit
Training
• Number of Iterations
Under-fitting Over-fitting
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here! (Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we have a chance
to overfit
• Number of Iterations (epochs) is a
hyperparameter
• Less epochs=> Suboptimal solution
(Underfit)
• Too many epochs=> Overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• When data is plentiful, set aside a part
of training data as Validation Data->
Perform Model Selection
• Declare final result on Test Data
• Typical ratio for splitting into training,
validation, test data: 60:20:20

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• K-fold cross-validation
• When data is not
sufficient, split data in k
segments,
train with (k-1) segments,
validate with 1 segment
and iterate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 5
Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization Batch Normalization

•Normalization is a procedure to •Batch normalization is a technique for


change the value of the numeric training very deep neural networks that
variable in the dataset to a typical normalizes the contributions to a layer
scale, without misshaping for every mini-batch. This has the impact
contrasts in the range of value. of settling the learning process and
drastically decreasing the number of
training epochs required to train deep
neural networks.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Normalization is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape, to ensure that our
model can generalize appropriately.

• Batch normalization is a process to make neural networks faster and


more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on
the input of a layer coming from a previous layer.

• The normalizing process in batch normalization takes place in batches,


not as a single input.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-processing stage. When
the input passes through the first layer, it transforms, as a sigmoid function applied over the dot product of input X
and the weight matrix W.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Similarly, this transformation will take place for the second layer and go till the last layer L as shown in the following
image.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Although, our input X was normalized with time the output will no longer be on the same scale. As the data go through multiple
layers of the neural network and L activation functions are applied, it leads to an internal co-variate shift in the data.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Internal Covariate Shift is the change in the distribution of network
activations due to the change in network parameters during training

https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
Amity Centre for Artificial Intelligence, Amity University, Noida, India
if we stabilize the input values for
each layer (defined as z = Wx +
b, where z is the linear
transformation of the W
weights/parameters and the biases),
we can prevent our activation
Fig. From gradient it can be observed that
function from putting our input larger z , the function approaches zero, When
values into the max/minimum network’s nodes exist in this space, training
values of our activation function slows down significantly, since gradient values
decrease.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Transforming the data to have a mean zero and standard deviation one
• Calculate the mean and standard deviation of the hidden layer activation.

no. of neurons
at layer h

• Normalize the hidden activations by this subtracting the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).

• γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the
vector containing values from the previous operations.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Speed Up the Training
By Normalizing the hidden layer activation the Batch normalization speeds up
the training process.

• Handles internal covariate shift


It solves the problem of internal covariate shift. Through this, we ensure that
the input for every layer is distributed around the same mean and standard deviation.

• Smoothens the Loss Function


Batch normalization smoothens the loss function that in turn by optimizing the
model parameters improves the training speed of the model.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 6
Hyperparameter tunning

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameters are defined as the parameters that are explicitly
defined by the user to control the learning process
• They are used to calculate model parameters, they are specific to
algorithm and can not be calculated from the data unlike
parameters
• It is selected and set by before the learning algorithm begins
training the model. Hence, these are external to the model, and
their values cannot be changed during the training process.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


•The k in kNN or K-Nearest Neighbour algorithm
•Learning rate for training a neural network
•Number of layers
•Number of nodes per layer
•Momentum
•Train-test split ratio
•Batch Size
•Number of Epochs
•Number of clusters in Clustering Algorithm
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Model Hyper
Model Parameter Parameter
•They are used by the model for •These are usually defined manually by
making predictions. the machine learning engineer.
•They are learned by the model
from the data itself •One cannot know the exact best value
•These are usually not set for hyperparameters for the given
manually. problem. The best value can be
•These are the part of the model determined either by trial and error
and key to a machine learning
Algorithm.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameter for Optimization
• Learning Rate
• Batch Size

• Hyperparameter for Specific Models


• Number of hidden units
• Number of layers
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Hyperparameter tuning consists of finding a set of optimal
hyperparameter values for a learning algorithm while applying this
optimized algorithm to any data set

• It maximizes the model’s performance, minimizing a predefined loss


function to produce better results with fewer errors.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Some important hyperparameters that require tuning in neural
networks are:
• Number of hidden layers
• Number of nodes/neurons per layer
• Learning rate
• Momentum

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameters can be tunned either manually or can be
automated.
• Some automated hyperparameter tuning methods include:
• Grid search,
• Random search,
• Bayesian optimization.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Grid search is a sort of “brute force”
hyperparameter tuning method. A
grid of possible discrete
hyperparameter values fit the model
with every possible combination. The
model performance for each set is
recoded and select the combination
that has produced the best
performance.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• It chooses random values rather than
using a predefined set of values like
the grid search method.

• Tries a random combination of


hyperparameters in each iteration
and records the model performance.
After several iterations, it returns the
mix that produced the best result.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Grid Search. Random Search.

Grid and random search often evaluate many unsuitable


hyperparameter combinations.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• This method treats the search for
the optimal hyperparameters as
an optimization problem.
• When choosing the next
hyperparameter combination, this
method considers the previous
evaluation results and then
applies a probabilistic function to
select the combination that will
probably yield the best results

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameters are the parameters that
are explicitly defined to control the
learning process before applying to a
learning algorithm.
• These are used to specify the learning
capacity and complexity of the model.
• Some of the hyperparameters are used for
the optimization of the models, such as
Batch size, learning rate, etc., and some are
specific to the models, such as Number of
Hidden layers, etc.

Amity Centre for Artificial Intelligence, Amity University, Noida, India

You might also like