Ilovepdf Merged Unit 2 Compressed

Deep learning
• Batch Normalization
Normalization vs. Standardization
Amity Centre for Artificial Intelligence, Amity University, Noida, India




Why Normalization

Why Normalization
Covariant Shift
Why Normalization

Why Normalization

Why Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Deep
Neural
Networks
(alternate
Explaination:
Bias Variance
Trade-off)
Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.

Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.
High Bias Large error in training as well as testing data
Hypothesis is too simple or linear in nature
The data predicted is in a straight line format, thus

not fitting accurately in the data in the data set.
High Bias in the Model

Underfitting
Example,

Variance
Variance: The variability of model prediction for a given data point which tells us the spread of the
data .
High Variance Very complex fit to the training data
Not able to fit accurately on the data which it

hasn’t seen before (Test Data)
Models perform very well on training data but have

high error rates on test data
High Variance in the Model
Overfitting
Example,
Bias Variance Tradeoff
Bias and variance typically trade off in relation to model complexity

If the algorithm is too
High bias and
simple (hypothesis
Low variance
with linear equation)
condition
If algorithms fit too

High variance
complex (hypothesis with
and low bias.
high degree equation)

An algorithm can’t be more complex and less complex at the same time.
To optimize the value of the total error for the model by using
the Bias-Variance Tradeoff:
The best fit will be given by the hypothesis on the tradeoff point.
This is referred to as the best point chosen for the training of

the algorithm which gives low error in training as well as
testing data.

1.High Bias and High Variance
(The Worst-Case Scenario)
2.Low Bias and Low Variance

(The Best-Case Scenario)
3. Low Bias and High Variance

(Overfitting)
4. High Bias and Low Variance

(Underfitting)

Exponential Moving Average
Task: Approximating a given parameter that changes in time where,
we are aware of all of its previous values. The objective is to predict
the next value which depends on the previous values.
One possible strategy: Take the average of the last several values.
This might work in certain cases but it is not very suitable for scenarios
when a parameter is more dependent on the most recent values.
Second possible strategy: To distribute higher weights to more recent

values and assign fewer weights to prior values.
It is based on the assumption that more recent values of a variable contribute more to the formation
of the next value than precedent values.

•vₜ is a time series that approximates a given

variable. Its index t corresponds to the
timestamp t.
•The value v₀ for the initial timestamp t = 0 is
usually taken as 0.
•θ is the observation on the current
iteration.
•β is a hyperparameter between 0 and 1
which defines how weight importance
should be distributed between a previous
average value vₜ-₁ and the current
observation θ

Exponential moving average for the t-th timestamp
• The most recent observation θ has a weight of

1, the second last observation — β, the third last
— β², etc.
• Since 0 < β < 1, the multiplication term βᵏ goes
exponentially down with the increase of k, so
the older the observations, the less important In practice, the value for β is usually chosen close to 0.9.
they are.
• Finally, every sum term is multiplied by (1 —β).

Exponential Moving Average By using this equation, for a chosen value of β, we can
compute an approximate number of timestamps t it takes for
weight terms to reach the value of 1 / e ≈ 0.368).
Mathematical Interpretation
The famous
second
wonderful limit • Taking β = 0.9 indicates that
approximately in t = 10 iterations, the
By making a weight decays to 1 / e, compared to the
substitution weight of the current observation.
β=1-x • In other words, the exponential
weighted average mostly depends only
on the last t = 10 observations.
As in the equation for the exponential moving
average, every observation value is multiplied by a
term βᵗ . Then on comparing both forms:

Bias correction
• The common
problem with using
exponential
weighted average is
that in most
problems it cannot
approximate well
the first series
values. Case 1: v₀ = 0 Case 2: v₀ = value of first observation θ₁
• It occurs due to the Though this approach works well in some situations, it is still not
Then the first several values will
absence of a perfect, especially in cases when a given sequence is volatile. For
put a large weight on v₀ which is 0
sufficient amount example, if θ₂ differs too much from θ₁
whereas most of the points on
of data on the first
the scatterplot are above 20.
iterations.
Imprecise Approximation It will also result in poor Approximation for volatile data

Bias correction
• The solution is to use a
technique called “bias
correction”.
• Instead of simply using
computed values vₖ, they are
divided by (1 —βᵏ). Assuming
that β is chosen close to 0.9–1,
this expression tends to be
close to 0 for first iterations
where k is small.
• Thus, instead of slowly
accumulating the first several
values where v₀ = 0, they are
now divided by a relatively
small number scaling them into
larger values.
Bias correction

Gradient Descent : Representation

Gradient descent
Gradient descent is the simplest optimization
algorithm which computes gradients of loss function
with respect to model weights and updates them. Gradient descent equation
w is the weight vector,

dw is the gradient of w,
α is the learning rate,
t is the iteration number
Optimization problem with gradient descent in a ravine area.

Blue: starting point
Black: Local minimum area where the surface is much more
steep in one dimension than in another
Courtesy: towardsdatascience
Gradient descent
• In this example, the starting point and the local minima have different horizontal coordinates and are almost equal vertical
coordinates.
• Using gradient descent to find the local minima will likely make the loss function slowly oscillate towards vertical axes.
• These bounces occur because gradient descent does not store any history about its previous gradients making
gradient steps more undeterministic on each iteration.
Thus, large learning rate  disconvergence.
Setting the Learning Rate

Why do we need better optimization algorithms?
• In practice during Gradient Decent
technique can run into certain problems
during training that can slow down the
learning process or, in the worst case,
even prevent the optimal weights from
being found.
• These problems are, on the one hand,
so-called saddle points and, on the
other hand, local minima of the loss local minima Saddle point
function. At the saddle points and the
local minima the loss function becomes
flat and the gradient at this point goes
towards zero.

Gradient Descent
• A gradient close to zero in a saddle

point or in a local minimum does
not improve the weight parameters
and prevents the whole learning
process.
• results in a zig-zag motion towards
the optimal weights and can slow
down learning a lot
Gradient Descent
Momentum
It would be desirable to make a loss function performing larger
steps in the horizontal direction and smaller steps in the
vertical.
Momentum uses a pair of
equations at each iteration:
Exponentially moving
average for gradient
values dw The momentum term increases for dimensions
Normal gradient descent whose gradients point in the same directions
update using the computed and reduces updates for dimensions whose
moving average value on the gradients change directions. As a result, we
current iteration. gain faster convergence and reduced oscillation
(An overview of gradient descent optimization
algorithms∗ Sebastian Ruder)
Momentum
Instead of simply using them for updating weights, we take several Momentum usually converges
past values and literaturally perform update in the averaged direction. much faster than gradient
descent. With Momentum,
there are also fewer risks in
using larger learning rates,
thus accelerating the training
process.
Optimization
with Momentum
In Momentum, it is
recommended to choose
β close to 0.9.

Momentum
Momentum technique is an
approach which provides an
update rule that is motivated from
the physical perspective of
optimization. Imagine a ball in a
hilly terrain is trying to reach the
deepest valley. When the slope of
the hill is very high, the ball gains a
lot of momentum and is able to
pass through slight hills in its way.
As the slope decreases the
momentum and speed of the ball
decreases, eventually coming to
rest in the deepest position of
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a valley.
global minimum (the left well) and local minimum (the right well)
Momentum • In general, velocity can be seen to increase
with time. By using the momentum term,
saddle points and local minima become less
dangerous for the gradient. This is because the
step size toward the global minimum now
depends not only on the slope of the loss
function at the current point, but also on the
velocity that has built up over time.
The advantage of momentum is that it

makes very small change to SGD but
provides a big boost to speed of learning.
We need to store the velocity for all the
parameters, and use this velocity for
SGD (black) vs. SGD with momentum (blue) making the updates.
Nesterov Accelerated Gradient
Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up. So, to resolve
this issue the NAG algorithm was developed. It is a look ahead method.

Nesterov Accelerated Gradient is a momentum-based SGD optimizer that
"looks ahead" to where the parameters will be to calculate the gradient ex post
rather than ex ante:
projected
gradient
V initialised to 0
Like SGD with momentum (β) is usually set to 0.9.

The projected gradient value can be obtained by going ‘one step ahead’ using the previous velocity. This
means that for this time step t, there need to carry out another forward propagation before executing the
backpropagation.
Steps:
1.Update the current weight wt to a projected weight w* using the
previous velocity.
Carry out forward propagation, but using this projected weight.
3.Obtain the projected gradient ∂L/∂w*.
4.Compute Vt and wt+1 accordingly.



The intuition is that the standard momentum method first computes the
gradient at the current location and then takes a big jump in the direction of the
updated accumulated gradient. In contrast Nesterov momentum first makes a
big jump in the direction of the previous accumulated gradient and then
measures the gradient where it ends up and makes a correction. The idea being
that it is better to correct a mistake after you have made it.

AdaGrad (Adaptive Gradient Algorithm)
(to adapt the learning rate to computed gradient values.)
• There might occur situations
when during training, one Adagrad accumulates element-wise squares dw² of gradients
component of the weight from all previous iterations.
vector has very large gradient
values while another one has
extremely small. During weight update, instead of using normal learning rate α,
• This happens especially in AdaGrad scales it by dividing α by the square root of the
cases when an infrequent accumulated gradients √vₜ.
model parameter appears to
have a low influence on
predictions.
• The same problem can occur
with sparse data where there a small positive term ε is added to
is too little information about the denominator to prevent
potential division by zero.
certain features

Advantage:
The greatest advantage of AdaGrad is that
there is no longer a need to manually adjust
the learning rate as it adapts itself during
training.
• AdaGrad deals with the aforementioned

problem by independently adapting the learning
rate for each weight component.
• If gradients corresponding to a certain weight
vector component are large, then the respective
learning rate will be small.
• Inversely, for smaller gradients, the learning rate
will be bigger. This way, Adagrad deals with
vanishing and exploding gradient problems.

Disadvantage:
• The learning rate constantly
decays with the increase of
iterations (the learning rate is
always divided by a positive
cumulative number).
Therefore, the algorithm
tends to converge slowly
during the last iterations
where it becomes very low.

AdaGrad (white) vs. gradient descent (cyan) on a terrain with a saddle point. The learning rate of AdaGrad is set to be
higher than that of gradient descent, but the point that AdaGrad’s path is straighter stays largely true regardless of
learning rate.This property allows AdaGrad (and other similar
From the animation, it can
be seen that Adagrad
might converge slower
compared to other
methods. This could be
because the accumulated
gradient in the
denominator causes the
learning rate to shrink and
become very small,
thereby slowing down the
learning over time.

Issue with a squared gradient for vₜ :
• Transformation equations when using a squared gradient:
last square
gradient at
every iteration
•If dw > 0, then a weight w is decreased by α.

•If dw < 0, then a weight w is increased by α.
• Thus, if vₜ = dw², then model weights can only be changed by ±α.
• Though this approach works sometimes, it is still not flexible the algorithm becomes
extremely sensitive to the choice of α and absolute magnitudes of gradient are ignored
which can make the method tremendously slow to converge.
• A little positive aspect about this algorithm is the fact only a single bit is required to
store signs of gradients which can be handy in distributed computations with strict
memory requirements.
RMSProp (Root Mean Square Propagation)
RMSProp was elaborated as an improvement over AdaGrad which tackles the
issue of learning rate decay. exponentially moving average
• However, instead of storing a cumulated sum of squared
gradients dw² for vₜ, the exponentially moving average is
calculated for squared gradients dw².
• Experiments show that RMSProp generally converges faster

than AdaGrad because, with the exponentially moving
average, it puts more emphasis on recent gradient values
rather than equally distributing importance between all
gradients by simply accumulating them from the first iteration.
• Furthermore, compared to AdaGrad, the learning rate in

RMSProp does not always decay with the increase of iterations
making it possible to adapt better in particular situations.

In RMSProp, it is recommended to choose β close to 1.

RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run also shows the
sum of gradient squared represented by the squares.
Adam (Adaptive Moment Estimation)
• Adam is the most famous optimization algorithm in deep learning.
• Adam combines Momentum and RMSProp algorithms. To achieve it, it simply keeps
track of the exponentially moving averages for computed gradients and squared
gradients respectively.
• Furthermore, it is possible to use bias correction for moving averages for a more
precise approximation of gradient trend during the first several iterations.
• The experiments show that Adam adapts well to almost any type of neural network
architecture taking the advantages of both Momentum and RMSProp.
first
momentum.
Updated weight
Second momentum.
Adam (Adaptive Moment Estimation)
According to the Adam paper (https://arxiv.org/pdf/1412.6980.pdf), good default values for

hyperparameters are β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

Role of first moment and second moment play in adaptively
adjusting the learning rate
First Moment:
•Also known as the mean squared gradient, it represents the exponentially
decaying average of past gradients for each parameter.
•Imagine it as a "moving average" of how steeply the loss function changes in the
direction of each parameter.
•This helps to track the overall trend of the gradient, preventing Adam from being
overly affected by sudden spikes or fluctuations.
•Its contribution is to provide a smoother and more stable direction for updating
the weights compared to using just the current gradient.

Role of first moment and second moment play in adaptively
adjusting the learning rate
Second Moment:
•Also known as the RMSprop squared gradient , it represents the exponentially decaying
average of squared past gradients for each parameter.
•Think of it as a measure of how "jumpy" or volatile the recent changes in the
gradient have been for each parameter.
•If the second moment is high, it indicates significant fluctuations, and Adam reduces the
learning rate for that parameter, preventing it from overshooting the minimum loss.
•Conversely, a low second moment suggests consistent improvement, and Adam allows a
faster learning rate for that parameter.
•The contribution of the second moment is to dynamically adjust the learning rate for
each parameter, preventing overshooting and allowing faster convergence in areas with
smoother changes.

Steps Involved in the Adam Optimization Algorithm
1. Initialize the first and second moments’ moving averages (v and s) to zero.
2. Compute the gradient of the loss function to the model parameters.
3. Update the moving averages using exponentially decaying averages. This involves
calculating vt and st as weighted averages of the previous moments and the
current gradient.
4. Apply bias correction to the moving averages, particularly during the early
iterations.
5. Calculate the parameter update by dividing the bias-corrected first moment by the
square root of the bias-corrected second moment, with an added small constant
(epsilon) for numerical stability.
6. Update the model parameters using the calculated updates.
7. Repeat steps 2-6 for a specified number of iterations or until convergence.

Advantage
It tends to focus on faster computation time, whereas algorithms like stochastic
gradient descent focus on data points. That’s why algorithms like SGD
generalize the data in a better manner at the cost of low computation speed.
So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.
Disadvantage
It doesn’t focus on data points rather focus on computation time
Note: So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.

Visualizations of various optimization algorithms.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Summary- Optimizers

Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 1
Loss Function

“Visualizing the loss
landscape of neural
nets”. Dec 2017.

Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
W ←W − ƞ

Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
W ←W − ƞ
• Learning rate for training the network.

• It has a high impact in performance of the model.
• How can we set the learning rate?

• Setting smaller learning rate means not trusting the gradient.
• Small learning rate converges slowly and gets stuck in false local minima.
J(W)
Initial guess
W
• Large learning rates overshoot, become unstable and diverge which is more undesirable.
J(W)
Initial guess
W
• Setting learning rate is very challenging.
• Stable learning rates converge smoothly and avoid local minima
J(W)
Initial guess
W

How to deal with setting learning rate?
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

How to deal with setting learning rate?
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

Adaptive Learning Rates
• Learning rates are no longer fixed

• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...

Summary
• Loss function: Compares the target and predicted output values to measures
how well the neural network models the training data.
• Types of Loss Function:
• Regression loss
• Classification loss
• Learning rate: is a hyper-parameter used to govern the pace at which an
algorithm updates or learns the values of a parameter estimate.
• Setting an adaptive learning rate is a better solution to fixed learning rate

Adaptive Learning Rates
Algorithm Tensorflow implementation
• Adam
• Adadelta
• Adagrad
• RMSProp

Adagrad (Adaptive Gradient Descent)
• In this change in learning rate depends upon the difference in parameters
during training. The more the parameters get changed, the more minor the
learning rate changes. The formula to update the weights.
𝝏𝑱 𝝎
𝒕 𝟏= 𝒕 𝒕 𝝏𝝎
constant
small positive to
different learning avoid division by 0
rates at each iteration

Adagrad (Adaptive Gradient Descent)
• Advantage: It abolishes the need to modify the learning rate manually. it

reaches convergence at a higher speed.
• Disadvantage: It decreases the learning rate aggressively and monotonically.

There might be a point when the learning rate becomes extremely small,
because the squared gradients in denominator keep accumulating, and thus
the denominator increasing. Due to small learning rates, the model
eventually becomes unable to acquire more knowledge, thus, accuracy of
the model is compromised.

RMSprop (Root mean square propogation)
• It uses sign of the gradient, adapting the step size (momentum) individually
for each weight.
• Two gradients are first compared for signs. For same sign- going in right
direction - Increase the step size by a small fraction. For opposite signs -
decrease the step size.
• The algorithm keeps the moving average of squared gradients for every
weight and divides the gradient by the square root of the mean square.
𝒏 𝝏𝑱 𝝎
𝑾𝒕 𝟏 = 𝒕 𝒗 𝒘,𝒕 𝝏𝝎
𝝏𝑱 𝝎
𝒗 𝒘, 𝒕 + 𝟏 = 𝜸 𝒗 𝒘, 𝒕 + (1- 𝜸) ( )
𝝏𝝎
Momentum or
forgetting factor,
usually 0.9
RMSprop (Root mean square propogation)
• Advantage:
It reduces monotonical decrease in learning rate as in
AdaGrad.
• Disadvantage: It doesn’t work well with large datasets but

with mini-batches of data.

Adadelta
• AdaDelta is a stochastic optimization technique that allows for

per-dimension learning rate method for SGD.
• It is an extension of Adagrad that seeks to reduce its aggressive,

monotonically decreasing learning rate.
• Instead of accumulating all past squared gradients, Adadelta

restricts the window of accumulated past gradients to a fixed size
w.
Adadelta

m.
Adam (Adaptive moment estimation)

• Adam optimizer updates the learning rate for each network weight individually.
• The first moment is mean, and the second moment is uncentered variance (meaning we
don’t subtract the mean during variance calculation).
𝝏𝑱 𝝎
• 𝒕 𝟏 𝒕 𝟏 + (1- 𝟏) 𝝏𝝎 Bias corrected
𝝏𝑱 𝝎 estimators for
• 𝒕 𝟐 𝒕 -1 + (1- 𝟐) 𝝏𝝎 the first and
second
moments.
• mt and vt initialized as 0,it is observed that they gain a tendency to be ‘biased towards 0’ as
both β1 & β2 ≈ 1. fixes this problem by computing ‘bias-corrected’ mt and vt. This control
the weights while reaching the global minimum to prevent high oscillations when near it.
• Algorithm has a faster running time, low memory requirements, and requires less tuning
than any other optimization algorithm.

Visualizations of various optimization algorithms.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Deep learning
• Course Code:
• Unit 2
• Lecture 3
Overfitting and
underfitting bias variance trade
off

• The model is too simplistic and not
able to learn enough from the training
data
• Hence it reduces the accuracy and

produces unreliable predictions.
• How to avoid Underfitting?

• By increasing the training time of The model is unable to capture the data points
present in the plot.
the model.
• By increasing the number of Source:- https://www.javatpoint.com/overfitting-
features. and-underfitting-in-machine-learning
• The model is too simplistic and not able
to learn enough from the training data
• Hence it reduces the accuracy and

produces unreliable predictions.
• Reason for Underfitting?

• Data used for training is not cleaned
and contains noise (garbage values)
in it The model is unable to capture the data points
• The model has a high bias present in the plot.
• The size of the training dataset used
is not enough Source:- https://www.javatpoint.com/overfitting-
• The model is too simple and-underfitting-in-machine-learning
• When learning a model we have a set of data (training set)
that we use to learn the model parameters
• The evaluation of the model needs to happen out-of-sample,
i.e., on a different set that was not used for learning model
parameters
• One of the most common problems during training is tying
the model to the training set
– Overfitting

• When a model is over fitted it is not expected to perform well
to new data
– It is not generalizable
• Overfitting occurs when the model chosen is too complex that

ends up describing the noise in the data instead of the trend
– E.g., too many parameters relative to the size of the training dataset
– An over fitted model memorizes the training instances and does not
learn the general trend in them

•Data used for training is not cleaned and contains noise
(garbage values) in it
•The model has a high variance
•The size of the training dataset used is not enough
•The model is too complex

• Bias of a Model: Underlying assumptions to make learning possible.
Simpler model=>More assumption=> High Bias
• Variance of a Model: Variability of model for given data points, Model

with high variance pays a lot of attention to training data, may end up
memorizing data rather than learning from it

• If we want to minimize MSE, we need to minimize both bias and variance
• However, when bias gets smaller, variance increases and vice versa
• A model that is underfitted has high bias
– Misses relevant relations between the independent variables and the
response variable
– Bias is reduced by increasing model complexity
• A model that is overfitted has high variance
• The model captures the noise in the training data instead of the trend
• Variance is reduced by decreasing model complexity

Trading off goodness of fit against complexity of the model

• The real aim of supervised learning is to do well on test data that is not known
during learning
• Choosing the values for the parameters that minimize the loss function on the
training data is not necessarily the best policy
• Generalization refers to How well the model trained on the training data
predicts the correct output for new instances
• We want the learning machine to model the true regularities in the data
and to ignore the noise in the data.
• But the learning machine does not know which regularities are real and
which are accidental quirks of the particular set of training examples we
happen to pick
• So how can we be sure that the machine will generalize correctly to new
data?
Model Selection: Which model is best?
Source: https://www.javatpoint.com/overfitting-in-machine-learning
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

Simple model has
less parameters to be Complex model has more
learned parameters to be learned
(Low complexity, (High complexity,
low capacity) High capacity)
Model may Underfit, it may Model may Overfit, it may

not capture underlying trend start learning from noise
of the data and inaccurate data entries
Higher error for Lower error for
training data, may give training data, may give
high error for validation data higher error for validation
also data
High Bias, Low Variance Low Bias, High Variance

Deep learning
• Course Code:
• Unit 2
• Lecture 4
How to avoid overfitting

Problem of overfitting
Source: https://www.javatpoint.com/overfitting-in-machine-learning
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

• Train with more data to avoid overfitting,
regularize the model
• Capturing and labeling of data is usually
expensive
• New data is generated from existing data,
with the help of
• Image rotations,
• Translation
• Blur, include noise
• Change brightness
• scaling
• flips (up down, left right)
• and so on
Very Deep
Training set
Many neurons
Slide source: Coding Lane

Very Deep
Many neurons Regularization






Types of Regularization
Regularization
Ridge (L2) Lasso (L1) Elastic Net

Regularization Regularization Regularization

Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.
Cost function =
For linear regression line, let’s consider two
points that are on the line,
= Sum of the squared residuals

= Penalty for the errors
= slope of the curve/line
Slide source: simplilearn

Cost function =
For linear regression line, let’s consider two 1.96
points that are on the line,
= 0 (considering the two points on the

line) Linear regression line
=1
= 1.4
Then, Cost function =

Cost function =
For ridge regression line, let’s assume, Ridge regression line
0.63
=
=1
= 0.7
Then, Cost function =

Comparing the two models, with all data points,

we can see that the Ridge regression line fits the Ridge regression line
1.96
model more accurately than the linear
0.63
regression line
Linear regression line

equivalent to the sum of the absolute values of coefficients.
Cost function =
Here,
= Slope of the curve/line

equivalent to the sum of the absolute values of coefficients.
Comparing the two models, with all data points,

we can see that the Lasso regression line fits the Lasso regression line
1.4
model more accurately than the linear 0.8
regression line
Linear regression line

equivalent to the sum of the squares of the magnitude of coefficients
and sum of the absolute values of coefficients.
It is the combination of Ridge and Lasso regularization

Cost function =
Here,
= Slope of the curve/line

Ridge Lasso Elastic Net
Useful when we have many Preferred when we are Preferred when we do not
variables with relatively fitting a linear model with know whether we want
smaller data samples fewer variables shrinkage or sparsity in the
parameter space.
Ridge will reduce the impact Lasso will eliminate many Elastic Net combines
of features that are not features, and reduce feature elimination from
important in predicting overfitting in the linear Lasso and feature
output values model. coefficient reduction from
the Ridge model to improve
the model predictions.

• During training, some number of nodes
are randomly ignored or “dropped out”
• During weight updation, the layer
configuration appears “new”
• Provides Regularization by avoiding co-
adaption between network layers to
correct mistakes from prior layers
• Improves generaliza on of the model
• Useful in Wider Networks to avoid
overfitting

• Stop training before we
have a chance to overfit
• Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
Overfitting
Training Iterations
Testing • Number of Iterations
(epochs) is a
Training
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
Overfitting
Training Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
Overfitting
Training Iterations
Testing
Training (epochs) is a
hyperparameter
• Less epochs=>
(Underfit)
Overfitting
Training Iterations
Testing have a chance to overfit
Training • Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
(Underfit)
Overfitting
Training Iterations
Testing
Training
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here!
(Underfit)
Overfitting
Training Iterations
Testing
Training
Under-fitting Over-fitting
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here! (Underfit)
Overfitting
Training Iterations
• Stop training before we have a chance
to overfit
• Number of Iterations (epochs) is a
hyperparameter
• Less epochs=> Suboptimal solution
(Underfit)
• Too many epochs=> Overfitting

• When data is plentiful, set aside a part
of training data as Validation Data->
Perform Model Selection
• Declare final result on Test Data
• Typical ratio for splitting into training,
validation, test data: 60:20:20

• K-fold cross-validation
• When data is not
sufficient, split data in k
segments,
train with (k-1) segments,
validate with 1 segment
and iterate

Deep learning
• Course Code:
• Unit 2
• Lecture 5
Batch Normalization

Normalization Batch Normalization
•Normalization is a procedure to •Batch normalization is a technique for

change the value of the numeric training very deep neural networks that
variable in the dataset to a typical normalizes the contributions to a layer
scale, without misshaping for every mini-batch. This has the impact
contrasts in the range of value. of settling the learning process and
drastically decreasing the number of
training epochs required to train deep
neural networks.

• Normalization is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape, to ensure that our
model can generalize appropriately.
• Batch normalization is a process to make neural networks faster and

more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on
the input of a layer coming from a previous layer.
• The normalizing process in batch normalization takes place in batches,

not as a single input.
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-processing stage. When
the input passes through the first layer, it transforms, as a sigmoid function applied over the dot product of input X
and the weight matrix W.
Similarly, this transformation will take place for the second layer and go till the last layer L as shown in the following
image.
Although, our input X was normalized with time the output will no longer be on the same scale. As the data go through multiple
layers of the neural network and L activation functions are applied, it leads to an internal co-variate shift in the data.
• Internal Covariate Shift is the change in the distribution of network
activations due to the change in network parameters during training
https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
if we stabilize the input values for
each layer (defined as z = Wx +
b, where z is the linear
transformation of the W
weights/parameters and the biases),
we can prevent our activation
Fig. From gradient it can be observed that
function from putting our input larger z , the function approaches zero, When
values into the max/minimum network’s nodes exist in this space, training
values of our activation function slows down significantly, since gradient values
decrease.

• Transforming the data to have a mean zero and standard deviation one
• Calculate the mean and standard deviation of the hidden layer activation.
no. of neurons
at layer h
• Normalize the hidden activations by this subtracting the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).
• γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the
vector containing values from the previous operations.

https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
• Speed Up the Training
By Normalizing the hidden layer activation the Batch normalization speeds up
the training process.
• Handles internal covariate shift

It solves the problem of internal covariate shift. Through this, we ensure that
the input for every layer is distributed around the same mean and standard deviation.
• Smoothens the Loss Function

Batch normalization smoothens the loss function that in turn by optimizing the
model parameters improves the training speed of the model.

Deep learning
• Course Code:
• Unit 2
• Lecture 6
Hyperparameter tunning

• Hyperparameters are defined as the parameters that are explicitly
defined by the user to control the learning process
• They are used to calculate model parameters, they are specific to
algorithm and can not be calculated from the data unlike
parameters
• It is selected and set by before the learning algorithm begins
training the model. Hence, these are external to the model, and
their values cannot be changed during the training process.

•The k in kNN or K-Nearest Neighbour algorithm
•Learning rate for training a neural network
•Number of layers
•Number of nodes per layer
•Momentum
•Train-test split ratio
•Batch Size
•Number of Epochs
•Number of clusters in Clustering Algorithm
Model Hyper
Model Parameter Parameter
•They are used by the model for •These are usually defined manually by
making predictions. the machine learning engineer.
•They are learned by the model
from the data itself •One cannot know the exact best value
•These are usually not set for hyperparameters for the given
manually. problem. The best value can be
•These are the part of the model determined either by trial and error
and key to a machine learning
Algorithm.

• Hyperparameter for Optimization
• Learning Rate
• Batch Size
• Hyperparameter for Specific Models

• Number of hidden units
• Number of layers
• Hyperparameter tuning consists of finding a set of optimal
hyperparameter values for a learning algorithm while applying this
optimized algorithm to any data set
• It maximizes the model’s performance, minimizing a predefined loss

function to produce better results with fewer errors.

• Some important hyperparameters that require tuning in neural
networks are:
• Number of hidden layers
• Number of nodes/neurons per layer
• Learning rate
• Momentum

• Hyperparameters can be tunned either manually or can be
automated.
• Some automated hyperparameter tuning methods include:
• Grid search,
• Random search,
• Bayesian optimization.

• Grid search is a sort of “brute force”
hyperparameter tuning method. A
grid of possible discrete
hyperparameter values fit the model
with every possible combination. The
model performance for each set is
recoded and select the combination
that has produced the best
performance.

• It chooses random values rather than
using a predefined set of values like
the grid search method.
• Tries a random combination of

hyperparameters in each iteration
and records the model performance.
After several iterations, it returns the
mix that produced the best result.

Grid Search. Random Search.
Grid and random search often evaluate many unsuitable

hyperparameter combinations.
• This method treats the search for
the optimal hyperparameters as
an optimization problem.
• When choosing the next
hyperparameter combination, this
method considers the previous
evaluation results and then
applies a probabilistic function to
select the combination that will
probably yield the best results

• Hyperparameters are the parameters that
are explicitly defined to control the
learning process before applying to a
learning algorithm.
• These are used to specify the learning
capacity and complexity of the model.
• Some of the hyperparameters are used for
the optimization of the models, such as
Batch size, learning rate, etc., and some are
specific to the models, such as Number of
Hidden layers, etc.

Ilovepdf Merged Unit 2 Compressed

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ilovepdf Merged Unit 2 Compressed

Uploaded by

Copyright:

Available Formats

Deep learning

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

High Bias Large error in training as well as testing data

Hypothesis is too simple or linear in nature

The data predicted is in a straight line format, thus

High Bias in the Model

Amity Centre for Artificial Intelligence, Amity University, Noida, India

High Variance Very complex fit to the training data

Not able to fit accurately on the data which it

Models perform very well on training data but have

Amity Centre for Artificial Intelligence, Amity University, Noida, India

If algorithms fit too

Amity Centre for Artificial Intelligence, Amity University, Noida, India

This is referred to as the best point chosen for the training of

Amity Centre for Artificial Intelligence, Amity University, Noida, India

2.Low Bias and Low Variance

3. Low Bias and High Variance

4. High Bias and Low Variance

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Second possible strategy: To distribute higher weights to more recent

Exponential Moving Average

Amity Centre for Artificial Intelligence, Amity University, Noida, India

•vₜ is a time series that approximates a given

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Exponential moving average for the t-th timestamp

• The most recent observation θ has a weight of

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

w is the weight vector,

Optimization problem with gradient descent in a ravine area.

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• A gradient close to zero in a saddle

Amity Centre for Artificial Intelligence, Amity University, Noida, India

The advantage of momentum is that it

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Like SGD with momentum (β) is usually set to 0.9.

Carry out forward propagation, but using this projected weight.

3.Obtain the projected gradient ∂L/∂w*.

4.Compute Vt and wt+1 accordingly.

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India