DNN Full Merged Compressed Compressed

Deep learning
Loss Function, Gradient

Decent Algorithm,
Backpropagation
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Loss Function
• Compares the target and predicted output values to measures how well the neural
network models the training data.
• The aim is to minimize this loss between predicted & target outputs.
• Majorly 2 types of loss function
Note ( Loss function
vs cost function)
Loss Cost
Function Function
Is loss for Is the Regression loss Classification Loss
a single average
training loss over
example/ the entire
• MSE (Mean square error ) • Binary cross-entropy
input training • MAE (Mean absolute error) • Categorical cross-entropy
dataset.

Mean Squared Error (MSE)
• MSE finds the average of squared differences b/w

the target and predicted outputs
• The difference is squared, which means it does not
matter whether the predicted value is above or
below the target value; however, values with a
large error are penalized.
• MSE is also a convex function with a clearly defined
global minimum.
• This allows to more easily utilize gradient descent
optimization to set the weight values.

Mean Absolute Error (MAE)
• MAE finds the average of the absolute
differences between the target and the
predicted outputs.
• As MSE is highly sensitive to outliers,

which can dramatically affect the loss
because the distance is squared. MAE
is used in cases when the training data
has a large number of outliers to
mitigate this.
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred)

Binary cross-entropy/Log Loss
• Binary cross entropy compares each
of the predicted probabilities to the
actual class output which can be
either 0 or 1.
• It then calculates the score that
penalizes the probabilities based on
the distance from the expected
High Low Low High value. That means how close or far
penalty penalty penalty penalty from the actual value.
• Advantage –A cost function is a
differential.
• Disadvantage –Multiple local
minima, Not intuitive

Categorical cross-entropy
• Also called Softmax Loss. It is a
One-hot encoding Softmax activation plus a Cross-
Entropy loss.
• It is used for multi-class classification.
• In the specific (and usual) case of
Multi-Class classification the labels are
one-hot encoded.
• Sparse Categorical Cross Entropy Loss
Function:
• Used when number of classes is
too large (eg 1000)
• Avoid one-hot encoding, which
requires large memory

• We want to find the network weights that
achieve the lowest loss and those weights
∗ 1 can be used for its prediction.
W = argmin σ𝑛𝑖=1 𝐿(𝑓 𝑥𝑖 ; 𝑾 , 𝑠𝑖) • Here W is the set of weights, we need to
𝑛
W
find the optimal set of weights that tries to
W*= argmin J(W) minimize the loss over our entire test set.
W • Test set is the data set that we want to
evaluate or model.
• Argmin = argument of minimum is used to
get the minimum weight. Where W is the
collection or set of all the weights

W* = argmin J(W)
w
Remember : Our Loss function is just a
simple function in terms of those
weights.
If we plot a 2 Dimensional
• Weights are on x and y axis whereas

loss is marked on z axis.
• For any value of w, we can see the loss
would be at that point.
• We need to find the point on this
landscape i.e. what are the values of W
that has minimum loss.

W* = argmin J(W)
w
• Randomly pick a place on this
landscape to start finding the
minimum weights.
• Form this random place we find
how this landscape is changing,
how the slope of landscape is
changes using gradient of the loss
with respect to each of the weights.
• The gradient is a vector which gives
us the direction in which loss
function has the steepest ascent.

W* = argmin J(W) • Gradient tell us which way to move
w
to find the steepest landscape using
function:
𝝏𝑱(𝑾)
𝝏𝑾
• Here we can see the higher
landscape with respect to the
selected point so we need to take
step in direction that’s lower than
the selected point.
We can take the gradient of the loss
with respect to each of these weights
to understand the direction of
maximum ascent.

W* = argmin J(W)
w
• Take small step in opposite direction

of gradient.
• On getting the lower point, the
process need to be repeated over
and over again until we converged
to a local minimum point.

Gradient Descent
• Repeat until Convergence

Gradient Descent
Algorithm for gradient descent:
1. Initialize the weights randomly ~N(0,𝝈 𝟐) weights = tf.random_normal( )
2. Loop until finding the convergence:

𝝏𝑱(𝒘)
3. Compute gradient, grads = tf.gradients(loss, weights)
𝝏𝒘
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾
5. Return weights
Now to summarize the algorithm which is known as gradient descent – taking a gradient and descending down the
landscape by initializing the weights randomly we compute the gradient d(J) with respect to all of our weights then we
update our weights in opposite direction of that gradient and we take a small step which we call here eta of that gradient
and this is referred to as Learning rate. eta is a scalar number – that indicates how much step you want to take at each
iteration – how strongly or aggressively you want to step towards that gradient.

Gradient Descent
1. Initialize the weights randomly ~N(0,𝝈 𝟐 ) weights = tf.random_normal( )

𝜕𝐽(𝒘)
3. Compute gradient, 𝜕𝒘
grads = tf.gradients(loss, weights)
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾
5. Return weights
• The amount that the weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
• The magic line here is actually how to you compute that gradient – that’s something not easy at all. So the question is
given a loss given all of our weights in our network how do we know which way is good –which way is a good place to
move. - That’s a process called back-propagation. We will discuss back propagation using elementary calculus.

Gradient Descent

1. Initialize the weights randomly ~N(0,𝝈 𝟐)
𝝏𝑱(𝒘)
3. Compute gradient,
𝝏𝒘
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
5. Return weights Can be very
computationally
intensive to
compute!

Stochastic Gradient Descent

1. Initialize the weights randomly ~N(0,𝝈 𝟐)
3. Pick a single data point i,
𝝏𝑱𝒊(𝒘)
4. Compute gradient,
𝝏𝒘
𝝏𝑱(𝒘)
𝝏𝑾
Easy to compute but
6. Return weights very noisy
(stochastic)!

Stochastic Gradient Descent with momentum
• SGD is noisy & requires more iteration to
reach minima. Adding a momentum term to
regular SGD for faster convergence of loss
function.
• SGD oscillates between either direction of
gradient & updates the weights accordingly.
By adding a fraction of the previous update
to the current update will make the process
a bit faster. velocity v denote
• Updated weight, Wt+1 = 𝑾𝒕 − 𝑽𝒕 the change in the
𝝏𝑱(𝒘) gradient to reach the
𝑽𝒕= β Vt-1 + ƞ global minima.
𝝏𝑾
• learning rate should be decreased wit
momentum term.

Mini-batch Gradient Descent

1. Initialize the weights randomly ~N(0,𝝈𝟐)
3. Pick batch of B data points

𝝏𝑱(𝒘) 𝟏 𝝏𝑱(𝒘)
4. Compute gradient, = σ𝑩
𝒌=𝟏
𝝏𝒘 𝑩 𝝏𝒘
𝝏𝑱(𝒘)
𝝏𝑾
Fast to compute and a
6. Return weights much better estimate of
the true gradient!

Mini-batches while training
• Mini-batch gradient descent is a variation of the gradient
descent algorithm that splits the training dataset into small
batches that are used to calculate model error and update
model coefficients.
• Mini-batch gradient descent seeks to find a balance between
the robustness of stochastic gradient descent and the efficiency
of batch gradient descent.
• More accurate estimation of gradient
• Smoother convergence Allows for larger learning
rates

Mini-batches while training
Summary Points
More accurate • Because of this estimation allows us to converge towards the target much quicker also it
means that gradients are more accurate in practice.
estimation of gradient • If it is quite noisy in our learning estimation we probably can increase our learning rate more
so that we don’t fully step in a wrong direction –if we are not confident with the gradient.
Smoother convergence • If we have a larger batch with more data to estimate our gradients with we can trust that
learning rate a little more and increase its steps more aggressively in that direction.
Allows for larger So we can finally Summarize:
learning rates • Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the
training dataset into small batches that are used to calculate model error and update model
Mini-batches lead to coefficients.
• Mini-batch gradient descent seeks to find a balance between the robustness of stochastic
fast training gradient descent and the efficiency of batch gradient descent.
Increase the
computation and Which this also means that we can massively parallelize the
computation because we can split up batches on multiple
achieve increased GPUs or multiple computers even to achieve more significant
speed on GPU’s speed ups with this training process.
Summary
• Batch Gradient Descent (BGD):
It uses the entire dataset at every step, making it slow for large datasets.
However, it is computationally efficient, since it produces a stable error gradient and a
stable convergence
• Stochastic Gradient Descent (SGD):
It is on the other extreme of the idea, using a single example (batch of 1) per each
learning step. Much faster, may return noisy gradients which can cause the error rate to
jump around
• Mini Batch Gradient Descent:
Computes the gradients on small random sets of instances called mini batches.
Reduce noise from SGD and still more efficient than BGD

Backpropagation Algorithm
• The algorithm is used to effectively
train a neural network through a
method called chain rule.
• After each forward pass through a

network, backpropagation
performs a backward pass while
adjusting the model’s parameters
(weights and biases).

• Backpropagation aims to
minimize the cost function by
adjusting network’s weights and
biases.
• The level of adjustment is

determined by the gradients of the
cost function with respect to those
parameters.

• Gradient of a function C(x1, x2, …, xm) in point x is a
vector of the partial derivatives of C in x.
• Equation for derivative of C in x
• Function C derivative measures the sensitivity to

change of function value (output value) with respect to
a change in its argument x (input value) or the
derivative tells us the direction C is going.
• The gradient shows how much the parameter x needs
to change (in positive or negative direction) to
minimize C.
• Compute those gradients happens using a technique
called chain rule.
Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)
How does a small change in one weight (ex. W2) affect the final loss J (W)?
• This is a Simple network with one input layer, one hidden layer (one hidden neuron) and one output layer,
the simplest neural network you can create.
• Computing the gradient of loss of W with respect to w2 ( that is between hidden state and W) can perform
lot of changes in loss value. We actually want to see - How does a small change in one weight (ex. W2)
affect the final loss J (W)?
• So this derivative is going to tell us how much a small change in this weight will affect our loss if we make
a small change in the weight, in one Direction will it increase our loss or decrease our loss.
• Like how a small change in w2 – makes how much change – up or down – how does it change – and by
how much really !

w2
x1 w1 Z1 𝑠ǁ 1 J(W)
𝜕𝐽(𝑾)
Gradient loss of W with respect to w2
𝜕 𝒘𝟐
Let’s use the chain rule!
So, to compute that we can use this derivative, we can start with applying the chain
rule backwards from the loss function through the output specifically.
So that’s the gradient we care about – The gradient of our loss with respect to w2.

w2
x1 w1 Z1 𝑠ǁ 1 J(W)
Split this into the gradient of our loss 𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ

= *
with respect to output. 𝜕 ( 𝑤 2) 𝜕 𝑠ǁ 𝜕(𝑤 2 )
• We can do is we can actually just decompose this derivative into two components the first component
• To evaluate this we can use the chain rule in elementary calculus.
• We can split them into gradient of the loss with respect to our output multiplied by gradient of output s with
respect to w2.
• The derivative of the loss with respect to our output multiplied by the derivative of our output with respect
to W2, this is just a standard use of the chain rule with this original derivative that we had on the left hand
side
w1 w2
x1 Z1 𝑠ǁ 1 J(W)
Now if we want to repeat this process with a

different weight – say w1.
𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ
*
Replace w2 with w1 and apply chain rule but we 𝜕( 𝑤1) = 𝜕 𝑠ǁ 𝜕 (𝑤 1 )
now notice that the gradient of output s with
respect to w1 is not directly computable , apply
chain rule again to evaluate.
Apply chain rule! Apply chain rule!
Here replace W2 with W1 and that chain Rule still
holds right that same equation holds.

w1 w2
x1 Z1 𝑠ǁ 1 J(W)
• So, lets apply the chain rule, and split with respect to Z.
• This way - Back propagation is done performing it, all the
gradients from the output to all the way back to the input that 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1
allow the error to propagate from output layer to input layer 𝜕 ( 𝑤 1) = * *
𝜕 𝑠ǁ 𝜕(𝑍 1 ) 𝜕𝑤 1
and allows these gradients to be computed in practice.
• There is lot of deep popular neural networks that performs
automatic differentiation which does all of these back
propagation.
Here it can be seen on the red component that last component of the chain rule, we have to once again recursively apply one more chain
rule because that's again another derivative that we can't directly evaluate. We can expand that once more with another instantiation of
the chain Rule and now all of these components.
We can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're
interested. In in this example we first computed the derivative with respect to W2 then we back propagated and used that information also
with W1. That's why we really call it back propagation because this process occurs from the output all the way back to the input

w1 w2
x1 Z1 𝑠ǁ 1 J(W)
Repeat this for every weight in the 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1

𝜕 ( 𝑤 1) = * *
network using gradients from later layers 𝜕 𝑠ǁ 𝜕(𝑍 1 ) 𝜕𝑤 1
Repeat this process essentially many times over the course of training by back-propagating.
These gradients over and over again through the network all the way from the output to the inputs to
determine for every single weight answering this question which is how much does a small change in these
weights affect our loss function if it increases it or decreases and how we can use that to improve the loss
ultimately because that's our final goal so that's the back propagation algorithm that's the core of training
neural networks.

Deep learning
• Batch Normalization
Normalization vs. Standardization




Why Normalization

Why Normalization
Covariant Shift
Why Normalization

Why Normalization

Why Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Batch Normalization

Deep
Neural
Networks
(alternate
Explaination:
Bias Variance
Trade-off)
Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.

Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.
High Bias Large error in training as well as testing data
Hypothesis is too simple or linear in nature
The data predicted is in a straight line format, thus

not fitting accurately in the data in the data set.
High Bias in the Model

Underfitting
Example,

Variance
Variance: The variability of model prediction for a given data point which tells us the spread of the
data .
High Variance Very complex fit to the training data
Not able to fit accurately on the data which it

hasn’t seen before (Test Data)
Models perform very well on training data but have

high error rates on test data
High Variance in the Model
Overfitting
Example,
Bias Variance Tradeoff
Bias and variance typically trade off in relation to model complexity

If the algorithm is too
High bias and
simple (hypothesis
Low variance
with linear equation)
condition
If algorithms fit too

High variance
complex (hypothesis with
and low bias.
high degree equation)

An algorithm can’t be more complex and less complex at the same time.
To optimize the value of the total error for the model by using
the Bias-Variance Tradeoff:
The best fit will be given by the hypothesis on the tradeoff point.
This is referred to as the best point chosen for the training of

the algorithm which gives low error in training as well as
testing data.

1.High Bias and High Variance
(The Worst-Case Scenario)
2.Low Bias and Low Variance

(The Best-Case Scenario)
3. Low Bias and High Variance

(Overfitting)
4. High Bias and Low Variance

(Underfitting)

Exponential Moving Average
Task: Approximating a given parameter that changes in time where,
we are aware of all of its previous values. The objective is to predict
the next value which depends on the previous values.
One possible strategy: Take the average of the last several values.
This might work in certain cases but it is not very suitable for scenarios
when a parameter is more dependent on the most recent values.
Second possible strategy: To distribute higher weights to more recent

values and assign fewer weights to prior values.
It is based on the assumption that more recent values of a variable contribute more to the formation
of the next value than precedent values.

•vₜ is a time series that approximates a given

variable. Its index t corresponds to the
timestamp t.
•The value v₀ for the initial timestamp t = 0 is
usually taken as 0.
•θ is the observation on the current
iteration.
•β is a hyperparameter between 0 and 1
which defines how weight importance
should be distributed between a previous
average value vₜ-₁ and the current
observation θ

Exponential moving average for the t-th timestamp
• The most recent observation θ has a weight of

1, the second last observation — β, the third last
— β², etc.
• Since 0 < β < 1, the multiplication term βᵏ goes
exponentially down with the increase of k, so
the older the observations, the less important In practice, the value for β is usually chosen close to 0.9.
they are.
• Finally, every sum term is multiplied by (1 —β).

Exponential Moving Average By using this equation, for a chosen value of β, we can
compute an approximate number of timestamps t it takes for
weight terms to reach the value of 1 / e ≈ 0.368).
Mathematical Interpretation
The famous
second
wonderful limit • Taking β = 0.9 indicates that
approximately in t = 10 iterations, the
By making a weight decays to 1 / e, compared to the
substitution weight of the current observation.
β=1-x • In other words, the exponential
weighted average mostly depends only
on the last t = 10 observations.
As in the equation for the exponential moving
average, every observation value is multiplied by a
term βᵗ . Then on comparing both forms:

Bias correction
• The common
problem with using
exponential
weighted average is
that in most
problems it cannot
approximate well
the first series
values. Case 1: v₀ = 0 Case 2: v₀ = value of first observation θ₁
• It occurs due to the Though this approach works well in some situations, it is still not
Then the first several values will
absence of a perfect, especially in cases when a given sequence is volatile. For
put a large weight on v₀ which is 0
sufficient amount example, if θ₂ differs too much from θ₁
whereas most of the points on
of data on the first
the scatterplot are above 20.
iterations.
Imprecise Approximation It will also result in poor Approximation for volatile data

Bias correction
• The solution is to use a
technique called “bias
correction”.
• Instead of simply using
computed values vₖ, they are
divided by (1 —βᵏ). Assuming
that β is chosen close to 0.9–1,
this expression tends to be
close to 0 for first iterations
where k is small.
• Thus, instead of slowly
accumulating the first several
values where v₀ = 0, they are
now divided by a relatively
small number scaling them into
larger values.
Bias correction

Gradient Descent : Representation

Gradient descent
Gradient descent is the simplest optimization
algorithm which computes gradients of loss function
with respect to model weights and updates them. Gradient descent equation
w is the weight vector,

dw is the gradient of w,
α is the learning rate,
t is the iteration number
Optimization problem with gradient descent in a ravine area.

Blue: starting point
Black: Local minimum area where the surface is much more
steep in one dimension than in another
Courtesy: towardsdatascience
Gradient descent
• In this example, the starting point and the local minima have different horizontal coordinates and are almost equal vertical
coordinates.
• Using gradient descent to find the local minima will likely make the loss function slowly oscillate towards vertical axes.
• These bounces occur because gradient descent does not store any history about its previous gradients making
gradient steps more undeterministic on each iteration.
Thus, large learning rate  disconvergence.
Setting the Learning Rate

Why do we need better optimization algorithms?
• In practice during Gradient Decent
technique can run into certain problems
during training that can slow down the
learning process or, in the worst case,
even prevent the optimal weights from
being found.
• These problems are, on the one hand,
so-called saddle points and, on the
other hand, local minima of the loss local minima Saddle point
function. At the saddle points and the
local minima the loss function becomes
flat and the gradient at this point goes
towards zero.

Gradient Descent
• A gradient close to zero in a saddle

point or in a local minimum does
not improve the weight parameters
and prevents the whole learning
process.
• results in a zig-zag motion towards
the optimal weights and can slow
down learning a lot
Gradient Descent
Momentum
It would be desirable to make a loss function performing larger
steps in the horizontal direction and smaller steps in the
vertical.
Momentum uses a pair of
equations at each iteration:
Exponentially moving
average for gradient
values dw The momentum term increases for dimensions
Normal gradient descent whose gradients point in the same directions
update using the computed and reduces updates for dimensions whose
moving average value on the gradients change directions. As a result, we
current iteration. gain faster convergence and reduced oscillation
(An overview of gradient descent optimization
algorithms∗ Sebastian Ruder)
Momentum
Instead of simply using them for updating weights, we take several Momentum usually converges
past values and literaturally perform update in the averaged direction. much faster than gradient
descent. With Momentum,
there are also fewer risks in
using larger learning rates,
thus accelerating the training
process.
Optimization
with Momentum
In Momentum, it is
recommended to choose
β close to 0.9.

Momentum
Momentum technique is an
approach which provides an
update rule that is motivated from
the physical perspective of
optimization. Imagine a ball in a
hilly terrain is trying to reach the
deepest valley. When the slope of
the hill is very high, the ball gains a
lot of momentum and is able to
pass through slight hills in its way.
As the slope decreases the
momentum and speed of the ball
decreases, eventually coming to
rest in the deepest position of
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a valley.
global minimum (the left well) and local minimum (the right well)
Momentum • In general, velocity can be seen to increase
with time. By using the momentum term,
saddle points and local minima become less
dangerous for the gradient. This is because the
step size toward the global minimum now
depends not only on the slope of the loss
function at the current point, but also on the
velocity that has built up over time.
The advantage of momentum is that it

makes very small change to SGD but
provides a big boost to speed of learning.
We need to store the velocity for all the
parameters, and use this velocity for
SGD (black) vs. SGD with momentum (blue) making the updates.
Nesterov Accelerated Gradient
Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up. So, to resolve
this issue the NAG algorithm was developed. It is a look ahead method.

Nesterov Accelerated Gradient is a momentum-based SGD optimizer that
"looks ahead" to where the parameters will be to calculate the gradient ex post
rather than ex ante:
projected
gradient
V initialised to 0
Like SGD with momentum (β) is usually set to 0.9.

The projected gradient value can be obtained by going ‘one step ahead’ using the previous velocity. This
means that for this time step t, there need to carry out another forward propagation before executing the
backpropagation.
Steps:
1.Update the current weight wt to a projected weight w* using the
previous velocity.
Carry out forward propagation, but using this projected weight.
3.Obtain the projected gradient ∂L/∂w*.
4.Compute Vt and wt+1 accordingly.



The intuition is that the standard momentum method first computes the
gradient at the current location and then takes a big jump in the direction of the
updated accumulated gradient. In contrast Nesterov momentum first makes a
big jump in the direction of the previous accumulated gradient and then
measures the gradient where it ends up and makes a correction. The idea being
that it is better to correct a mistake after you have made it.

AdaGrad (Adaptive Gradient Algorithm)
(to adapt the learning rate to computed gradient values.)
• There might occur situations
when during training, one Adagrad accumulates element-wise squares dw² of gradients
component of the weight from all previous iterations.
vector has very large gradient
values while another one has
extremely small. During weight update, instead of using normal learning rate α,
• This happens especially in AdaGrad scales it by dividing α by the square root of the
cases when an infrequent accumulated gradients √vₜ.
model parameter appears to
have a low influence on
predictions.
• The same problem can occur
with sparse data where there a small positive term ε is added to
is too little information about the denominator to prevent
potential division by zero.
certain features

Advantage:
The greatest advantage of AdaGrad is that
there is no longer a need to manually adjust
the learning rate as it adapts itself during
training.
• AdaGrad deals with the aforementioned

problem by independently adapting the learning
rate for each weight component.
• If gradients corresponding to a certain weight
vector component are large, then the respective
learning rate will be small.
• Inversely, for smaller gradients, the learning rate
will be bigger. This way, Adagrad deals with
vanishing and exploding gradient problems.

Disadvantage:
• The learning rate constantly
decays with the increase of
iterations (the learning rate is
always divided by a positive
cumulative number).
Therefore, the algorithm
tends to converge slowly
during the last iterations
where it becomes very low.

AdaGrad (white) vs. gradient descent (cyan) on a terrain with a saddle point. The learning rate of AdaGrad is set to be
higher than that of gradient descent, but the point that AdaGrad’s path is straighter stays largely true regardless of
learning rate.This property allows AdaGrad (and other similar
From the animation, it can
be seen that Adagrad
might converge slower
compared to other
methods. This could be
because the accumulated
gradient in the
denominator causes the
learning rate to shrink and
become very small,
thereby slowing down the
learning over time.

Issue with a squared gradient for vₜ :
• Transformation equations when using a squared gradient:
last square
gradient at
every iteration
•If dw > 0, then a weight w is decreased by α.

•If dw < 0, then a weight w is increased by α.
• Thus, if vₜ = dw², then model weights can only be changed by ±α.
• Though this approach works sometimes, it is still not flexible the algorithm becomes
extremely sensitive to the choice of α and absolute magnitudes of gradient are ignored
which can make the method tremendously slow to converge.
• A little positive aspect about this algorithm is the fact only a single bit is required to
store signs of gradients which can be handy in distributed computations with strict
memory requirements.
RMSProp (Root Mean Square Propagation)
RMSProp was elaborated as an improvement over AdaGrad which tackles the
issue of learning rate decay. exponentially moving average
• However, instead of storing a cumulated sum of squared
gradients dw² for vₜ, the exponentially moving average is
calculated for squared gradients dw².
• Experiments show that RMSProp generally converges faster

than AdaGrad because, with the exponentially moving
average, it puts more emphasis on recent gradient values
rather than equally distributing importance between all
gradients by simply accumulating them from the first iteration.
• Furthermore, compared to AdaGrad, the learning rate in

RMSProp does not always decay with the increase of iterations
making it possible to adapt better in particular situations.

In RMSProp, it is recommended to choose β close to 1.

RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run also shows the
sum of gradient squared represented by the squares.
Adam (Adaptive Moment Estimation)
• Adam is the most famous optimization algorithm in deep learning.
• Adam combines Momentum and RMSProp algorithms. To achieve it, it simply keeps
track of the exponentially moving averages for computed gradients and squared
gradients respectively.
• Furthermore, it is possible to use bias correction for moving averages for a more
precise approximation of gradient trend during the first several iterations.
• The experiments show that Adam adapts well to almost any type of neural network
architecture taking the advantages of both Momentum and RMSProp.
first
momentum.
Updated weight
Second momentum.
Adam (Adaptive Moment Estimation)
According to the Adam paper (https://arxiv.org/pdf/1412.6980.pdf), good default values for

hyperparameters are β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

Role of first moment and second moment play in adaptively
adjusting the learning rate
First Moment:
•Also known as the mean squared gradient, it represents the exponentially
decaying average of past gradients for each parameter.
•Imagine it as a "moving average" of how steeply the loss function changes in the
direction of each parameter.
•This helps to track the overall trend of the gradient, preventing Adam from being
overly affected by sudden spikes or fluctuations.
•Its contribution is to provide a smoother and more stable direction for updating
the weights compared to using just the current gradient.

Role of first moment and second moment play in adaptively
adjusting the learning rate
Second Moment:
•Also known as the RMSprop squared gradient , it represents the exponentially decaying
average of squared past gradients for each parameter.
•Think of it as a measure of how "jumpy" or volatile the recent changes in the
gradient have been for each parameter.
•If the second moment is high, it indicates significant fluctuations, and Adam reduces the
learning rate for that parameter, preventing it from overshooting the minimum loss.
•Conversely, a low second moment suggests consistent improvement, and Adam allows a
faster learning rate for that parameter.
•The contribution of the second moment is to dynamically adjust the learning rate for
each parameter, preventing overshooting and allowing faster convergence in areas with
smoother changes.

Steps Involved in the Adam Optimization Algorithm
1. Initialize the first and second moments’ moving averages (v and s) to zero.
2. Compute the gradient of the loss function to the model parameters.
3. Update the moving averages using exponentially decaying averages. This involves
calculating vt and st as weighted averages of the previous moments and the
current gradient.
4. Apply bias correction to the moving averages, particularly during the early
iterations.
5. Calculate the parameter update by dividing the bias-corrected first moment by the
square root of the bias-corrected second moment, with an added small constant
(epsilon) for numerical stability.
6. Update the model parameters using the calculated updates.
7. Repeat steps 2-6 for a specified number of iterations or until convergence.

Advantage
It tends to focus on faster computation time, whereas algorithms like stochastic
gradient descent focus on data points. That’s why algorithms like SGD
generalize the data in a better manner at the cost of low computation speed.
So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.
Disadvantage
It doesn’t focus on data points rather focus on computation time
Note: So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.

Visualizations of various optimization algorithms.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Summary- Optimizers

Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 1
Loss Function

“Visualizing the loss
landscape of neural
nets”. Dec 2017.

Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
W ←W − ƞ

Loss Functions Can Be Difficult to Optimize
Remember:
Optimization through gradient descent
W ←W − ƞ
• Learning rate for training the network.

• It has a high impact in performance of the model.
• How can we set the learning rate?

• Setting smaller learning rate means not trusting the gradient.
• Small learning rate converges slowly and gets stuck in false local minima.
J(W)
Initial guess
W
• Large learning rates overshoot, become unstable and diverge which is more undesirable.
J(W)
Initial guess
W
• Setting learning rate is very challenging.
• Stable learning rates converge smoothly and avoid local minima
J(W)
Initial guess
W

How to deal with setting learning rate?
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

How to deal with setting learning rate?
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

Adaptive Learning Rates
• Learning rates are no longer fixed

• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...

Summary
• Loss function: Compares the target and predicted output values to measures
how well the neural network models the training data.
• Types of Loss Function:
• Regression loss
• Classification loss
• Learning rate: is a hyper-parameter used to govern the pace at which an
algorithm updates or learns the values of a parameter estimate.
• Setting an adaptive learning rate is a better solution to fixed learning rate

Adaptive Learning Rates
Algorithm Tensorflow implementation
• Adam
• Adadelta
• Adagrad
• RMSProp

Adagrad (Adaptive Gradient Descent)
• In this change in learning rate depends upon the difference in parameters
during training. The more the parameters get changed, the more minor the
learning rate changes. The formula to update the weights.
𝝏𝑱 𝝎
𝒕 𝟏= 𝒕 𝒕 𝝏𝝎
constant
small positive to
different learning avoid division by 0
rates at each iteration

Adagrad (Adaptive Gradient Descent)
• Advantage: It abolishes the need to modify the learning rate manually. it

reaches convergence at a higher speed.
• Disadvantage: It decreases the learning rate aggressively and monotonically.

There might be a point when the learning rate becomes extremely small,
because the squared gradients in denominator keep accumulating, and thus
the denominator increasing. Due to small learning rates, the model
eventually becomes unable to acquire more knowledge, thus, accuracy of
the model is compromised.

RMSprop (Root mean square propogation)
• It uses sign of the gradient, adapting the step size (momentum) individually
for each weight.
• Two gradients are first compared for signs. For same sign- going in right
direction - Increase the step size by a small fraction. For opposite signs -
decrease the step size.
• The algorithm keeps the moving average of squared gradients for every
weight and divides the gradient by the square root of the mean square.
𝒏 𝝏𝑱 𝝎
𝑾𝒕 𝟏 = 𝒕 𝒗 𝒘,𝒕 𝝏𝝎
𝝏𝑱 𝝎
𝒗 𝒘, 𝒕 + 𝟏 = 𝜸 𝒗 𝒘, 𝒕 + (1- 𝜸) ( )
𝝏𝝎
Momentum or
forgetting factor,
usually 0.9
RMSprop (Root mean square propogation)
• Advantage:
It reduces monotonical decrease in learning rate as in
AdaGrad.
• Disadvantage: It doesn’t work well with large datasets but

with mini-batches of data.

Adadelta
• AdaDelta is a stochastic optimization technique that allows for

per-dimension learning rate method for SGD.
• It is an extension of Adagrad that seeks to reduce its aggressive,

monotonically decreasing learning rate.
• Instead of accumulating all past squared gradients, Adadelta

restricts the window of accumulated past gradients to a fixed size
w.
Adadelta

m.
Adam (Adaptive moment estimation)

• Adam optimizer updates the learning rate for each network weight individually.
• The first moment is mean, and the second moment is uncentered variance (meaning we
don’t subtract the mean during variance calculation).
𝝏𝑱 𝝎
• 𝒕 𝟏 𝒕 𝟏 + (1- 𝟏) 𝝏𝝎 Bias corrected
𝝏𝑱 𝝎 estimators for
• 𝒕 𝟐 𝒕 -1 + (1- 𝟐) 𝝏𝝎 the first and
second
moments.
• mt and vt initialized as 0,it is observed that they gain a tendency to be ‘biased towards 0’ as
both β1 & β2 ≈ 1. fixes this problem by computing ‘bias-corrected’ mt and vt. This control
the weights while reaching the global minimum to prevent high oscillations when near it.
• Algorithm has a faster running time, low memory requirements, and requires less tuning
than any other optimization algorithm.

Visualizations of various optimization algorithms.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Deep learning
• Course Code:
• Unit 2
• Lecture 3
Overfitting and
underfitting bias variance trade
off

• The model is too simplistic and not
able to learn enough from the training
data
• Hence it reduces the accuracy and

produces unreliable predictions.
• How to avoid Underfitting?

• By increasing the training time of The model is unable to capture the data points
present in the plot.
the model.
• By increasing the number of Source:- https://www.javatpoint.com/overfitting-
features. and-underfitting-in-machine-learning
• The model is too simplistic and not able
to learn enough from the training data
• Hence it reduces the accuracy and

produces unreliable predictions.
• Reason for Underfitting?

• Data used for training is not cleaned
and contains noise (garbage values)
in it The model is unable to capture the data points
• The model has a high bias present in the plot.
• The size of the training dataset used
is not enough Source:- https://www.javatpoint.com/overfitting-
• The model is too simple and-underfitting-in-machine-learning
• When learning a model we have a set of data (training set)
that we use to learn the model parameters
• The evaluation of the model needs to happen out-of-sample,
i.e., on a different set that was not used for learning model
parameters
• One of the most common problems during training is tying
the model to the training set
– Overfitting

• When a model is over fitted it is not expected to perform well
to new data
– It is not generalizable
• Overfitting occurs when the model chosen is too complex that

ends up describing the noise in the data instead of the trend
– E.g., too many parameters relative to the size of the training dataset
– An over fitted model memorizes the training instances and does not
learn the general trend in them

•Data used for training is not cleaned and contains noise
(garbage values) in it
•The model has a high variance
•The size of the training dataset used is not enough
•The model is too complex

• Bias of a Model: Underlying assumptions to make learning possible.
Simpler model=>More assumption=> High Bias
• Variance of a Model: Variability of model for given data points, Model

with high variance pays a lot of attention to training data, may end up
memorizing data rather than learning from it

• If we want to minimize MSE, we need to minimize both bias and variance
• However, when bias gets smaller, variance increases and vice versa
• A model that is underfitted has high bias
– Misses relevant relations between the independent variables and the
response variable
– Bias is reduced by increasing model complexity
• A model that is overfitted has high variance
• The model captures the noise in the training data instead of the trend
• Variance is reduced by decreasing model complexity

Trading off goodness of fit against complexity of the model

• The real aim of supervised learning is to do well on test data that is not known
during learning
• Choosing the values for the parameters that minimize the loss function on the
training data is not necessarily the best policy
• Generalization refers to How well the model trained on the training data
predicts the correct output for new instances
• We want the learning machine to model the true regularities in the data
and to ignore the noise in the data.
• But the learning machine does not know which regularities are real and
which are accidental quirks of the particular set of training examples we
happen to pick
• So how can we be sure that the machine will generalize correctly to new
data?
Model Selection: Which model is best?
Source: https://www.javatpoint.com/overfitting-in-machine-learning
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

Simple model has
less parameters to be Complex model has more
learned parameters to be learned
(Low complexity, (High complexity,
low capacity) High capacity)
Model may Underfit, it may Model may Overfit, it may

not capture underlying trend start learning from noise
of the data and inaccurate data entries
Higher error for Lower error for
training data, may give training data, may give
high error for validation data higher error for validation
also data
High Bias, Low Variance Low Bias, High Variance

Deep learning
• Course Code:
• Unit 2
• Lecture 4
How to avoid overfitting

Problem of overfitting
Source: https://www.javatpoint.com/overfitting-in-machine-learning
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

• Train with more data to avoid overfitting,
regularize the model
• Capturing and labeling of data is usually
expensive
• New data is generated from existing data,
with the help of
• Image rotations,
• Translation
• Blur, include noise
• Change brightness
• scaling
• flips (up down, left right)
• and so on
Very Deep
Training set
Many neurons
Slide source: Coding Lane

Very Deep
Many neurons Regularization






Types of Regularization
Regularization
Ridge (L2) Lasso (L1) Elastic Net

Regularization Regularization Regularization

Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.
Cost function =
For linear regression line, let’s consider two
points that are on the line,
= Sum of the squared residuals

= Penalty for the errors
= slope of the curve/line
Slide source: simplilearn

Cost function =
For linear regression line, let’s consider two 1.96
points that are on the line,
= 0 (considering the two points on the

line) Linear regression line
=1
= 1.4
Then, Cost function =

Cost function =
For ridge regression line, let’s assume, Ridge regression line
0.63
=
=1
= 0.7
Then, Cost function =

Comparing the two models, with all data points,

we can see that the Ridge regression line fits the Ridge regression line
1.96
model more accurately than the linear
0.63
regression line
Linear regression line

equivalent to the sum of the absolute values of coefficients.
Cost function =
Here,
= Slope of the curve/line

equivalent to the sum of the absolute values of coefficients.
Comparing the two models, with all data points,

we can see that the Lasso regression line fits the Lasso regression line
1.4
model more accurately than the linear 0.8
regression line
Linear regression line

equivalent to the sum of the squares of the magnitude of coefficients
and sum of the absolute values of coefficients.
It is the combination of Ridge and Lasso regularization

Cost function =
Here,
= Slope of the curve/line

Ridge Lasso Elastic Net
Useful when we have many Preferred when we are Preferred when we do not
variables with relatively fitting a linear model with know whether we want
smaller data samples fewer variables shrinkage or sparsity in the
parameter space.
Ridge will reduce the impact Lasso will eliminate many Elastic Net combines
of features that are not features, and reduce feature elimination from
important in predicting overfitting in the linear Lasso and feature
output values model. coefficient reduction from
the Ridge model to improve
the model predictions.

• During training, some number of nodes
are randomly ignored or “dropped out”
• During weight updation, the layer
configuration appears “new”
• Provides Regularization by avoiding co-
adaption between network layers to
correct mistakes from prior layers
• Improves generaliza on of the model
• Useful in Wider Networks to avoid
overfitting

• Stop training before we
have a chance to overfit
• Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
Overfitting
Training Iterations
Testing • Number of Iterations
(epochs) is a
Training
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
Overfitting
Training Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
Overfitting
Training Iterations
Testing
Training (epochs) is a
hyperparameter
• Less epochs=>
(Underfit)
Overfitting
Training Iterations
Testing have a chance to overfit
Training • Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
(Underfit)
Overfitting
Training Iterations
Testing
Training
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here!
(Underfit)
Overfitting
Training Iterations
Testing
Training
Under-fitting Over-fitting
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here! (Underfit)
Overfitting
Training Iterations
• Stop training before we have a chance
to overfit
• Number of Iterations (epochs) is a
hyperparameter
• Less epochs=> Suboptimal solution
(Underfit)
• Too many epochs=> Overfitting

• When data is plentiful, set aside a part
of training data as Validation Data->
Perform Model Selection
• Declare final result on Test Data
• Typical ratio for splitting into training,
validation, test data: 60:20:20

• K-fold cross-validation
• When data is not
sufficient, split data in k
segments,
train with (k-1) segments,
validate with 1 segment
and iterate

Deep learning
• Course Code:
• Unit 2
• Lecture 5
Batch Normalization

Normalization Batch Normalization
•Normalization is a procedure to •Batch normalization is a technique for

change the value of the numeric training very deep neural networks that
variable in the dataset to a typical normalizes the contributions to a layer
scale, without misshaping for every mini-batch. This has the impact
contrasts in the range of value. of settling the learning process and
drastically decreasing the number of
training epochs required to train deep
neural networks.

• Normalization is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape, to ensure that our
model can generalize appropriately.
• Batch normalization is a process to make neural networks faster and

more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on
the input of a layer coming from a previous layer.
• The normalizing process in batch normalization takes place in batches,

not as a single input.
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-processing stage. When
the input passes through the first layer, it transforms, as a sigmoid function applied over the dot product of input X
and the weight matrix W.
Similarly, this transformation will take place for the second layer and go till the last layer L as shown in the following
image.
Although, our input X was normalized with time the output will no longer be on the same scale. As the data go through multiple
layers of the neural network and L activation functions are applied, it leads to an internal co-variate shift in the data.
• Internal Covariate Shift is the change in the distribution of network
activations due to the change in network parameters during training
https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
if we stabilize the input values for
each layer (defined as z = Wx +
b, where z is the linear
transformation of the W
weights/parameters and the biases),
we can prevent our activation
Fig. From gradient it can be observed that
function from putting our input larger z , the function approaches zero, When
values into the max/minimum network’s nodes exist in this space, training
values of our activation function slows down significantly, since gradient values
decrease.

• Transforming the data to have a mean zero and standard deviation one
• Calculate the mean and standard deviation of the hidden layer activation.
no. of neurons
at layer h
• Normalize the hidden activations by this subtracting the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).
• γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the
vector containing values from the previous operations.

https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
• Speed Up the Training
By Normalizing the hidden layer activation the Batch normalization speeds up
the training process.
• Handles internal covariate shift

It solves the problem of internal covariate shift. Through this, we ensure that
the input for every layer is distributed around the same mean and standard deviation.
• Smoothens the Loss Function

Batch normalization smoothens the loss function that in turn by optimizing the
model parameters improves the training speed of the model.

Deep learning
• Course Code:
• Unit 2
• Lecture 6
Hyperparameter tunning

• Hyperparameters are defined as the parameters that are explicitly
defined by the user to control the learning process
• They are used to calculate model parameters, they are specific to
algorithm and can not be calculated from the data unlike
parameters
• It is selected and set by before the learning algorithm begins
training the model. Hence, these are external to the model, and
their values cannot be changed during the training process.

•The k in kNN or K-Nearest Neighbour algorithm
•Learning rate for training a neural network
•Number of layers
•Number of nodes per layer
•Momentum
•Train-test split ratio
•Batch Size
•Number of Epochs
•Number of clusters in Clustering Algorithm
Model Hyper
Model Parameter Parameter
•They are used by the model for •These are usually defined manually by
making predictions. the machine learning engineer.
•They are learned by the model
from the data itself •One cannot know the exact best value
•These are usually not set for hyperparameters for the given
manually. problem. The best value can be
•These are the part of the model determined either by trial and error
and key to a machine learning
Algorithm.

• Hyperparameter for Optimization
• Learning Rate
• Batch Size
• Hyperparameter for Specific Models

• Number of hidden units
• Number of layers
• Hyperparameter tuning consists of finding a set of optimal
hyperparameter values for a learning algorithm while applying this
optimized algorithm to any data set
• It maximizes the model’s performance, minimizing a predefined loss

function to produce better results with fewer errors.

• Some important hyperparameters that require tuning in neural
networks are:
• Number of hidden layers
• Number of nodes/neurons per layer
• Learning rate
• Momentum

• Hyperparameters can be tunned either manually or can be
automated.
• Some automated hyperparameter tuning methods include:
• Grid search,
• Random search,
• Bayesian optimization.

• Grid search is a sort of “brute force”
hyperparameter tuning method. A
grid of possible discrete
hyperparameter values fit the model
with every possible combination. The
model performance for each set is
recoded and select the combination
that has produced the best
performance.

• It chooses random values rather than
using a predefined set of values like
the grid search method.
• Tries a random combination of

hyperparameters in each iteration
and records the model performance.
After several iterations, it returns the
mix that produced the best result.

Grid Search. Random Search.
Grid and random search often evaluate many unsuitable

hyperparameter combinations.
• This method treats the search for
the optimal hyperparameters as
an optimization problem.
• When choosing the next
hyperparameter combination, this
method considers the previous
evaluation results and then
applies a probabilistic function to
select the combination that will
probably yield the best results

• Hyperparameters are the parameters that
are explicitly defined to control the
learning process before applying to a
learning algorithm.
• These are used to specify the learning
capacity and complexity of the model.
• Some of the hyperparameters are used for
the optimization of the models, such as
Batch size, learning rate, etc., and some are
specific to the models, such as Number of
Hidden layers, etc.

Deep learning
Convolutional Neural
Networks.
Images, Text,
Sound etc.

255 200 211 235 1 255 200 211 235 1
200 161 217 233 0 200 161 217 233 0

218 65 214 237 0 218 65 214 237 0
232 29 217 236 1 232 29 217 236 0
234 23 216 240 0 234 23 216 240 0
102 31 217 234 0 102 31 217 234 0
Computer Interpretation
• For a grayscale images, the pixel value is a single number that represents the brightness of the pixel. The most common pixel
format is the byte image, where this number is stored as an 8-bit integer giving a range of possible values from 0 to 255.
• Similarly for color images, each level is represented by the range of decimal numbers from 0 to 255 (256 levels for each
color), equivalent to the range of binary numbers from 00000000 to 11111111, or hexadecimal 00 to FF. The total number of
available colors is 256 x 256 x 256, or 16,777,216 possible color.
𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 86%
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 7%
𝑀𝑀. 𝑆𝑆. 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 5.8%
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1.2%
Classification
Input Image Pixel
Representation
Classification
Model

FACE
Facial Structure Eye, Nose, Ears
CAR
Edge, Corners
Vehicle Shape and Structure Head Light, Tyre

High-level Features Mid-level Features Low-level Features

Fully connected neural networks (FCNNs) are a type of artificial neural network where
the architecture is such that all the nodes, or neurons, in one layer are connected to the
neurons in the next layer.

Input Image
(2D, Matrix of Pixels)
x1 x2 xn
Fully Connected Layer

(Connects neurons of input layer and
hidden layer, has multiple parameters,
no spatial information)

Connect Input Layer
Convolution Filter patches to neurons of
Filter Size = 4 ꓫ 4
Number of Weights = 16
hidden layer/subsequent
Shift or Stride = 2 layer with sliding window
approach.
Step 1:
Extract Set of Local Features by
applying filters (set of weights)
Step 2:
Apply Multiple Filters for
extraction of different features
Step 3:
Spatial Sharing of parameters
for each filter
Input Image (Array of Pixels)

Convolutional Neural Network (Feature
Extraction and Convolution)
Input Image
(2-D array of
pixels)
Network X or O
Network X
Network O
Challenging Cases
Rotation Weighted Translation Scaling
Network X
Network O

Computer and Human Interpretation
=
Human
Interpretation
=
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
Computer -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
Interpretation -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Computer Interpretation
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
Pixel wise
-1 -1 -1 -1 1 -1 -1 -1 -1
Matching -1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=x
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
Decision -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Computers are Literal

Feature matching for symbol ‘X’
=
Piece Matching of Features
Features

Piece Matching of Features
1 -1 1
-1 1 -1
1 -1 1

Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Feature Map
1 -1 -1
-1 1 -1
-1 -1 1
Filter
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Feature Map
1 -1 -1 1 1 -1
-1 1 -1 1 1 1
-1 -1 1
-1 1 1
Filter
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 .55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
=
1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Convolutional layer
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Convolutional layer
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Kernel vs Filter
Kernel is that matrix which is

swiped precisely convolved
across a single channel of the
input.
Filter is the collection of all
kernels which are convolved on
the channels of the input.

Kernels and their effects on an Image
Identity Kernel
Original Image Output Image – Same as Original

Blur
Original Image Output Image

Left Sobel
Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction

Right Sobel

Bottom Sobel

Top Sobel

Emboss

The emboss kernel (similar to the Sobel kernel and sometimes referred to mean the same) givens the illusion of depth by
emphasizing the differences of pixels in a given direction. In this case, in a direction along a line from the top left to the bottom right.

Outline

Sharpen
The sharpen kernel emphasizes differences in adjacent pixel values. This makes the image look more vivid.

CONV: Convolutional
kernel layer
RELU: Activation
function
POOL: Dimension
reduction layer
FC: Fully connection
layer

Convolutional Neural Network--- Spatial View
Depth
32 Dimensions of Layer
H*W*D
Height
H (height) and W (width)
are spatial dimensions
whereas D (depth) is
Width number of filters
32
3
Stride = Step size of filter, Receptive Field = Location of connected path in an input image

Non-Linearity in
Network
Applied after every convolutional layer. The
rectified linear activation (ReLU) function is a
simple calculation that returns the value
provided as input directly, or the value 0.0 if
the input is 0.0 or less.
g(x) = max(0, x)

Rectified Linear Units (ReLUs)



0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0 0.33 0 1.00 0 0.33 0
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33

Pooling
STEPS
• Dimensionality Reduction 1. Pick a window size

• Preserve Spatial Invariance (usually 2 or 3).
2. Pick a stride (usually 2).
The types of pooling operations are: 3. Walk your window across
Max pooling: The maximum pixel value your filtered images.
of the batch is selected. 4. From each window, take
Average pooling: The average value of the maximum value.
all the pixels in the batch is selected.

Why Pooling ?
• Subsampling pixels will not change the object
bird
bird
Subsampling
We can subsample the pixels to make image

smaller fewer parameters to characterize the image

Pooling

Max-Pooling

Max-Pooling

Max-Pooling

Max-Pooling

Max-Pooling

Max-Pooling

Max-Pooling

Pooling layer
• A stack of images becomes a stack of smaller images.

Stride
• number of cells the filter is moved to calculate the next output
• sample only every s pixels in each direction in the output

Stride
• number of cells the filter is moved to calculate the next output
• sample only every s pixels in each direction in the output

Stride
• Stride = 2
• First Value:

Stride
• Stride = 2
• Next Value:
Size of output
feature map may
decrease
-4
-4 8*(-1) +
0*(-1) +
5*(-1)

Padding
In order to assist the kernel with
processing the image, padding is added
to the frame of the image to allow for
more space for the kernel to cover the
image. Adding padding to an image
processed by a CNN allows for more
accurate analysis of images.
• Use Conv without shrinking the height
and width
• Helpful in building deeper networks
• Keep more of the information at the
border of an image
Zero-Padding

Same Padding
• Buffers the edge of

the input with -15
filter_size/2 zeros
(integer division)
• Output dimension is
the same as the
input for s=1 1

Same Padding
• Buffers the edge of

the input with
filter_size/2 zeros
(integer division)
• Output dimension is
the same as the input
for s=1
• Output dimension
reduces less for s>1

These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
……
6 x 6 image
Each filter detects a small pattern (3 x 3).
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1 Dot
product
1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
6 x 6 image
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map
0 1 Two 3X3 Kernels
-1 -1 -2 1
0 0 1 0 1 0 Forming 4 x 4 x 2 matrix
3 -2 -2 -1
6 x 6 image -1 0 -4 3
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
connected 1 0 0 0 1 0
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 7 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 0
…
0 0 1 0 1 0
13 0 Only connect to
6 x 6 image 9 inputs, not
14 0
fully connected
fewer parameters! 15 1

1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
6 x 6 image 13: 0
Fewer parameters 14: 0
Shared weights
15: 1
16: 1
Click the link below or copy paste the URL in your browser
https://poloclub.github.io/cnn-explainer/
With the CNN Explainer you can Learn and implement Convolutional Neural
Network (CNN) in your browser! With real sample image dataset

Stacking of Layers
Convolutional Activation Pooling

Layer Function (Max-Pooling)
(ReLU)

Multiple Stacking of Layers
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 Conv. Acti. Fun. Conv. Acti. Fun. Pooling Conv. Acti. Fun.
-1 -1 1 -1 -1 -1 1 -1 -1 Layer (ReLU) Layer (ReLU) (Max- Layer (ReLU)
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 Pooling)
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Fully Connected Layer (Training Phase)
X
O
X
O
X
O
X
O
Fully Connected Layer (Testing Phase)
0.9
X
0.65
0.9 0.65 0.45
0.45 0.87 0.87
0.96
0.96 0.73
0.73
0.23 0.63
0.23
O
0.63
0.44 0.89
0.44
0.94 0.53
0.89
0.94
0.53

0.9
X
0.65
0.9 0.65
0.45 0.87
0.45
0.87
0.912
0.96
0.96 0.73 0.73
0.23 0.63
0.23
O
0.63
0.44 0.89
0.44
0.94 0.53
0.89
0.94
0.53

0.9
X
0.65
0.9 0.65 0.45
0.45 0.87 0.87
0.96
0.96 0.73 0.73
0.23 0.63
0.23
O
0.63
0.44 0.89
0.44
0.94 0.53
0.89
0.94
0.53

0.9
X
0.65
0.9 0.65 0.45
0.45 0.87 0.87
0.96
0.96 0.73 0.73
0.23 0.63
0.23
O
0.63
0.44 0.89
0.94 0.53
0.44
0.517
0.89
0.94
0.53

0.9
0.65
X
0.45
0.87
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.96
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 0.73
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 0.23
O
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.63
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.44
0.89
0.94
Fully Connected
Layer
0.53

Multiple Stacking of Fully Connected Layers
0.9
X
0.65
0.45
0.87
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.96
-1 -1 1 -1 -1 -1 1 -1 -1
0.73
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 0.23
O
-1 -1 -1 1 -1 1 -1 -1 -1
0.63
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.44
Fully Fully Fully
0.89 Connected Connected Connected
Layer 1 Layer 2 Layer 3
0.94
0.53

Stacking of Multiple Layers
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
Conv. Acti. Conv. Acti. Pooling Conv.

Layer
Acti.
Fun.
Pooling
(Max-
Fully
Conn.
Fully
Conn.
X
Layer Fun. Layer Fun. (Max-
(ReLU) (ReLU) Pooling) (ReLU) Pooling) Layer Layer

Convolutional Neural Network- Classification
Classification
Class A
Input Class B
Image
Class C
Class D
• Convolutional layer and Pooling help to extract high level features of input
• Fully connected layer used extracted high level features for classification of input image in different classes
• Output also include the class probability of the image

Formulas
(Output Dimensions Calculations)

Convolution operation
if a 𝑚𝑚 ∗ 𝑚𝑚 image convolved
with 𝑛𝑛 ∗ 𝑛𝑛 kernel,
the output image is of
size (𝑚𝑚 − 𝑛𝑛 + 1) ∗ (𝑚𝑚 − 𝑛𝑛 +
1).

Padding
if a 𝑛𝑛∗𝑛𝑛 matrix convolved

with an f*f matrix the
with padding p then the
size of the output image
will be (n + 2p — f + 1) *
(n + 2p — f + 1) where p
=1 in this case.
Padded image convolved with 2*2 kernel

Stride
Stride is the
number of
pixels shifts
over the
input matrix.
left image: stride =0, middle image: stride = 1, right image: stride =2
The Math.floor() static method

For padding p, filter size 𝑓𝑓∗𝑓𝑓 and input image size 𝑛𝑛 ∗ 𝑛𝑛 and stride ‘𝑠𝑠’ always rounds down and returns the
our output image dimension will be largest integer less than or equal to
[Math.floor{(𝑛𝑛 + 2𝑝𝑝 − 𝑓𝑓 ) / 𝑠𝑠} + 1] ∗ [Math.floor{(𝑛𝑛 + 2𝑝𝑝 − 𝑓𝑓 ) / 𝑠𝑠} + 1]. a given number.
If an image is 100×100, a filter is 6×6, the padding is 7, and the stride is 4, the result of convolution will be
(100 – 6 + (2)(7)) / 4 + 1 = 28×28.

Deep Neural
Networks

Deep Learning in 2D image (Recap)
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
2x2, Stride=2 Stride=1 2x2, Stride=2
Stride=1
Cat
8 Channels 8 Channels 16 Channels 16 Channels

INPUT
28x28x3
Input Image Classification Model 10 Units
64 Units
Output Class

Deep Learning in 1D Signal
Convolutional layers Flatten layer Fully-Connected Layer
Cat
Input Signal Output Class

Classification Model

Convolution in Time Series Signal
Inverted Kernel Convolution Operation
1 2 .5 0
Time series Signal

Padding = Same
0 1 0 0 0 0 -1 0 0 0 0
Time series Signal Representation

1 2 .5 0 2
Time series Signal

Padding = Same
0 1 0 0 0 0 -1 0 0 0 0

1 2 .5 0 2 0
+
Time series Signal
2 Padding = Same
0 1 0 0 0 0 -1 0 0 0 0

Time series Signal
0 0 1 0 0 0 0 -1 0 0 0
Inverted Kernel
Padding = Same
1 2 .5
=
Time series Signal
0 .5 2 1 0 0 -.5 -2 -1 0 0

Kernel Kernel size=3
Time series Signal

X X
Padding = Same
Convolution
Inverted Kernel
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X X
Padding = Same
Convolution
Result

Kernel size=3
Time series Signal

X X
Padding = Same
Convolution
Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 0 0 -1 0 0 Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 0 0 0 0 0 Result

Activation Function
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 0 0 0 0 0 Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Activation Function
Time series Signal
ReLu Activation
Result

Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0
max
Time series Signal

.5

Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0
max
Time series Signal

.5 2

Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0
max
Time series Signal

.5 2 0

Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0
max
Time series Signal

.5 2 0 0

Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0
max
Time series Signal

.5 2 0 0 0

Max Pooling
Max Pool size=2
Time series Signal
X X
Pooling
max
Result

Max Pooling
Max Pool size=2
Time series Signal
X X
Pooling
max
Result

Max Pooling
Max Pool size=2
Time series Signal
X X
Pooling
max
Result

Max Pooling
Max Pool size=2
Time series Signal
X X
Pooling
max
Result

Max Pooling
Max Pool size=2
Time series Signal
X X
Pooling
max
Result

Convolution with 2 kernels
0 0
0 0.5
1 2
0 1 1
0 2 0 Padding = Same
0 0.5 0 Activation = ReLU
0 Inverted Kernel 1 0
-1 0
0 0
0 0
0 0
Time series Signal
Feature Maps

Convolution with 2 kernels
0 0
0
0 0.5
2
1 2
1.5
0 1 1
0 0
1.5 0
2 0
0 0
Inverted Kernel 2 0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps

Flatten Layer
0
0 0
0
0 0.5
2
1 2
1.5
0 1 1
0 0
1.5 0
2 0
0 0
0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Flatten Layer
0
0 0
0 0.5
0 0.5
2
1 2
1.5
0 1 1
0 0
1.5 0
2 0
0 0
0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal

Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5
0 1 1
0 0
1.5 0
2 0
0 0
0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal

Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
1.5 0 .
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0
0 0
0
0 0
0
Time series Signal

Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
1.5 0 .
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0 0
0 0
0
0 0
0
Time series Signal

Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
1.5 0 .
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0 0
0 0
0 0
0 0
0
Time series Signal

Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0 0
0 0
0 0
0 0
0 0
Time series Signal

0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 . 7
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2
0 0 .
0 .
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer FC Layer

Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 . 7
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2
0 0 .
0 . Sigmoid
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal

Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2
0 0 .
0 . Sigmoid
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal

Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2 0.1
0 0 .
0 . Sigmoid
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal

Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2 0.1
0 0 .
0 . Sigmoid
-1 0
0 . 0.59
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal

Classification
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 . Class 1
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2 0.1
0 0 .
0 . Sigmoid
-1 0
0 . 0.59
0 0 0.4
0 0
0 0
0 0
0 0
0 0 Probability value
Time series Signal
Feature Maps Flatten Layer FC Layer for 3 classes

Time Series Signal Classifications using CNN
Input CNN Architecture for 1D signal Output

Deep learning
• Course Code:
Popular CNN architecture.
Networks and Transfer
Learning
Popular CNN Architectures

ImageNet Dataset
• ImageNet is a dataset of
• over 15 million labelled high-resolution
images
• from ~22,000 categories
• ImageNet Large Scale Visual
Recognition Challenge (ILSVRC):
• Between 2010 -2017
• Uses ~1000 categories, each with ~1000
images

ImageNet Large Scale Visual Recognition Challenge (ILSVRC):
Algorithms that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2010-2017. The top-5 error refers to the
probability that all top-5 classifications proposed by the algorithm for the image are wrong. The algorithms with blue graph are
convolutional neural network. Although VGGNet took second place in 2014, it is widely used in studies as its concise structure.
AlexNet
AlexNet is a pioneering
convolutional neural network
(CNN) used primarily for image
recognition and classification
tasks.
It won the ImageNet Large Scale
Visual Recognition Challenge in
2012, marking a breakthrough in
deep learning. AlexNet’s
architecture, with its innovative
use of convolutional layers and
rectified linear units (ReLU), laid Alexnet won the Imagenet large-scale visual recognition
the foundation for modern deep challenge in 2012. The model was proposed in 2012 in the
learning models, advancing research paper named ”Imagenet Classification with Deep
computer vision and pattern Convolution Neural Network” by Alex Krizhevsky and his
recognition applications. colleagues.
1. The Alexnet has eight layers with learnable
parameters.
2. The model consists of five layers with a
combination of max pooling followed by 3 fully
connected layers and they use Relu activation in
each of these layers except the output layer.
3. They found out that using the Relu activation
function accelerated the speed of the training by
almost six times.
4. They also used the dropout layers, that prevented
their model from overfitting. The model is trained
on the Imagenet dataset.
5. Total No. of parameters in this architecture is
62.3 million.

AlexNet
• Input:
227x227x3
• Conv 11x11,
5x5, three 3x3
• 3 MaxPool
• ReLu for
hidden units
• Softmax for
output
6x6x256=9216
Dropout p=0.5 Dropout p=0.5

P.S. : * Output= ((Input-filter size+2p)/
stride)+1
AlexNet
P.S. : * Output= ((Input-filter size)/ stride)+1

Summary Highlights in AlexNet
• ReLU activation (avoid vanishing gradient),
• Data Augmentation (avoid overfitting),
• Dropout regularization (avoid co-adaptation)
• Introduced Local Response Normalization (LRN)
• LRN is a non-trainable layer that square-
normalizes the pixel values in a feature map
within a local neighbourhood (Inter-channel,
Intra-channel)
• does lateral inhibition: refers to the capacity of a
neuron to reduce the activity of its neighbours

Inter-Channel LRN
The neighborhood defined is across the channel. For each (x,y)
position, the normalization is carried out in the depth
dimension and is given by the following formula
where i indicates the output of filter i, a(x,y), b(x,y) the pixel

values at (x,y) position before and after normalization
respectively, and N is the total number of channels. The
constants (k,α,β,n) are hyper-parameters. k is used to avoid
any singularities (division by zero), α is used as a
normalization constant, while β is a contrasting constant. The
constant n is used to define the neighborhood length i.e. how
many consecutive pixel values need to be considered while
carrying out the normalization. The case of (k,α, β,
n)=(0,1,1,N) is the standard normalization).
Intra-Channel LRN
In Intra-channel LRN, the neighborhood is extended within the
same channel only.
where (W,H) are the width and height of the feature map. The
only difference between Inter and Intra Channel LRN is the
neighborhood for normalization. In Intra-channel LRN, a 2D
neighborhood is defined (as opposed to the 1D neighborhood
in Inter-Channel) around the pixel under-consideration.

VGGNet
VGG stands for Visual Geometry Group
The VGG architecture is the basis of ground-

breaking object recognition models.
Why? VGGNet was born out of the need to
reduce the # of parameters in the CONV layers
and improve on training time.
What? There are multiple variants of VGGNet
• VGG- Network is a convolutional (VGG16, VGG19, etc.) which differ only in the
neural network model proposed by K. total number of layers in the network.
Simonyan and A. Zisserman in the
paper “Very Deep Convolutional
Networks for Large-Scale Image
It is one of the famous architectures in the deep learning
Recognition”. field. Replacing large kernel-sized filters with 11 and 5 in
• This architecture achieved top-5 test the first and second layer respectively showed the
accuracy of 92.7% in ImageNet, improvement over AlexNet architecture, with multiple 3×3
which has over 14 million images kernel-sized filters one after another. It was trained for
belonging to 1000 classes. weeks and was using NVIDIA Titan Black GPU’s.
VGGNet
• Developed by Visual Geometry Group in 2014
• VGG16 was 2nd in ILSVRC challenge 2014 (top-5 classification error of 7.32%)
• Characterized by Simplicity and Depth
• All Conv layers with 3x3 filters and stride 1, SAME padding
• All max polling layers 2x2 filters, stride 2
• VGG16: 16-layer CNN (16 layers with trainable parameters, over 134 million
parameters); VGG19: 19-layer CNN (more than 144 million parameters)
• VGG19: The concept of the VGG19 model (also VGGNet-19) is the same as the
VGG16 except that it supports 19 layers. The “16” and “19” stand for the number
of weight layers in the model (convolutional layers). This means that VGG19 has
three more convolutional layers than VGG16.

VGGNet
Conv= 3x3 filter, s=1, same ReLU activation in all hidden units
Max pool= 2x2, s=2 (5 Max pooling layers) Softmax activation in output units

3

Architecture of VGG:
Input: The VGGNet takes in an image input size of 224×224. For the ImageNet competition,
the creators of the model cropped out the center 224×224 patch in each image to keep the
input size of the image consistent.
Convolutional Layers:
VGG’s convolutional layers leverage a minimal receptive field, i.e., 3×3, the smallest possible
size that still captures up/down and left/right. Moreover, there are also 1×1 convolution filters
acting as a linear transformation of the input. This is followed by a ReLU unit, which is a huge
innovation from AlexNet that reduces training time.
The convolution stride is fixed at 1 pixel to keep the spatial resolution preserved after
convolution (stride is the number of pixel shifts over the input matrix).
Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not usually
leverage Local Response Normalization (LRN) as it increases memory consumption and
training time. Moreover, it makes no improvements to overall accuracy.
Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the three
layers, the first two have 4096 channels each, & the third has 1000 channels, 1 for each class.

Complexity and challenges
• The number of filters that we can use doubles on every step or through every stack of the
convolution layer. This is a major principle used to design the architecture of the VGG16
network.
• One of the crucial downsides of the VGG16 network is that it is a huge network, which means
that it takes more time to train its parameters.
• Because of its depth and number of fully connected layers, the VGG16 model is more than
533MB. This makes implementing a VGG network a time-consuming task.
Performance of VGG Models

• VGG16 highly surpasses the previous versions of models in the ILSVRC-2012 and ILSVRC-2013
competitions. Moreover, the VGG16 result is competing for the classification task winner
(GoogLeNet with 6.7% error) and considerably outperforms the ILSVRC-2013 winning
submission Clarifai. It obtained 11.2% with external training data and around 11.7% without it.
• In terms of the single-net performance, the VGGNet-16 model achieves the best result with
about 7.0% test error, thereby surpassing a single GoogLeNet by around 0.9%.

Stacking of Conv
Conv3
• Multiple Stacked Conv layers lead to Conv3
Wide Receptive Field
• In VGG, varying filter sizes are
implemented by stacking Conv layers
with fixed filter sizes
5x5 3x3
Effective Receptive Field

GoogLeNet (Inception v1)
• Developed by Google in 2014 Inception Module
• 1st position in ILSVRC challenge 2014

(top-5 classification error of 6.66%)
• 22-layers with trainable parameters
(27 layers including Max Pool layers)
• Parameters: 5 million (V1), 23 million (V3)
• Contains Inception Modules
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
"Going deeper with convolutions." In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1-9. 2015.

Inception Module/ Cell Inception Module

• Extract features at different scales from the
input (1x1, 3x3, 5x5)
• Max pooling with "same" padding to
preserve dimensions
• 1x1 Conv to decrease the number of feature
maps (feature-map pooling layer)

9 Inception Modules
Final
Classifier
Auxiliary Classifier

Final
Classifier
Auxiliary Classifiers
• Intermediate softmax branches at
• 5×5 Average
the middle Pooling (Stride 3)
• Only used during training • 1×1 Conv (128
filters)
• Purpose: combating vanishing • 1024 FC
gradient problem, regularization • 1000 FC
• Softmax
• Loss is added to the total loss,
with weight 0.3
27
What’s Novel in GoogleNet?
Inception module,
1x1 convolutions,
Global average pooling,
Auxiliary classifiers,
Increased network depth(22 layers).
Their architecture consisted of a 22 layer deep CNN but

reduced the number of parameters from 60 million (AlexNet)
to 4 million.
Glimpse of Backpropagation Algorithm

Glimpse of Backpropagation Algorithm
• After propagating the input features forward
to the output layer through the various
hidden layers consisting of different/same
activation functions, we come up with a
predicted probability of a sample belonging
to the positive class (generally, for
classification tasks).
• Now, the backpropagation algorithm
propagates backward from the output layer
to the input layer calculating the error
gradients on the way.
• Once the computation for gradients of the
cost function w.r.t each parameter (weights
and biases) in the neural network is done,
the algorithm takes a gradient descent step
towards the minimum to update the value
of each parameter in the network using
these gradients.
Vanishing Gradient
• As the backpropagation
algorithm advances
downwards(or backward)
from the output layer
towards the input layer,
the gradients often get
smaller and smaller and
approach zero which
eventually leaves the
weights of the initial or
lower layers nearly
unchanged.
• As a result, the gradient
descent never converges
to the optimum. This is
known as the vanishing
“The gradients will be very small for the earlier layers, means there is no major
gradients problem.
difference between the new weight and old weight.”

Vanishing Gradient
• The deterioration in the
gradient value is
proportional to the
depth of the network.
• The deeper the network,

the chance of getting a
lesser value of the
gradient towards the
end of back propagation.
• Vanishing gradient
problem is mainly occurs
with sigmoid and tanh
functions.
“The gradients will be very small for the earlier layers, means there is no major
difference between the new weight and old weight.”

Vanishing Gradient- Example
• Activation Functions such as the sigmoid
function have a very prominent difference
between the variance of their inputs and
outputs.
• They shrink and transform a large input
space into a smaller output space, which
lies between [0,1].
• Using larger inputs, regardless if they are
negative or positive will classify at either 0
or 1.
• However, when the Backpropagation
processes, it has no gradient to propagate
backward in the Neural Network.
• The little gradient that does exist, will
continuously keep diluting as the algorithm
continues to process through the top
layers, leaving nothing for the lower layers.

Exploding Gradient
• On the contrary, in some
cases, the gradients keep
on getting larger and
larger as the
backpropagation
algorithm progresses.
• This, in turn, causes very

large weight updates and
causes the gradient
descent to diverge.
• This is known as
the exploding
gradients problem.

Exploding Gradient
• Similarly, in some cases suppose
the initial weights assigned to the
network generate some large loss.
• Now the gradients can accumulate

during an update and result in very
large gradients which eventually
results in large updates to the
network weights and leads to an
unstable network.
• The parameters can sometimes

become so large that they overflow
and result in NaN values.

Vanishing Gradient vs. Exploding Gradient

How to identify a vanishing or exploding gradients problem?
Vanishing Exploding
• Large changes are observed in parameters of later layers, whereas
• Contrary to the vanishing scenario, exploding
parameters of earlier layers change slightly or stay unchanged
gradients shows itself as unstable, large
• In some cases, weights of earlier layers can become 0 as the parameter changes from batch/iteration to
training goes. batch/iteration
• The model learns slowly and often times, training stops after a few • Model weights can become NaN very quickly
iterations
• Model performance is poor • Model loss also goes to NaN

Methods to solve the problem of Vanishing/Exploding gradients
• Using less number of layers • Using the correct activation functions
A straightforward approach to solving Saturating functions such as sigmoid saturate the larger inputs
vanishing and exploding gradient problems and causes vanishing gradient problem. We can use non-
is to use less number of layers in our saturating activation functions such as ReLU and its alternatives
network. Using fewer layers will ensure that such as leaky ReLU.
the gradient is not multiplied too many
times. This may stop the gradient from
• Using batch normalization
vanishing or exploding, but it does cost us
Using batch normalization ensures that vanishing/exploding
the ability of our network to understand
gradients do not appear in between the layers.
complex features.
• Gradient clipping
• Careful weight initialization
It is a popular method used to solve the exploding gradient
We can solve both of these problems
problem. It limits the size of the gradients so that they never
partially by carefully choosing our model
exceed some specified value.
parameters initially.


• Skip Connections
 Skip connections (used in ResNet)
prevent the vanishing gradient
problem during deep neural
network training.
 These connections enable the

direct flow of information from
earlier layers to later layers, aiding
in preserving gradient and
promoting better convergence.
 The loss surface of the neural

network with skip connections is
smoother and thus leading to faster
convergence than the network
without any skip connections. The loss surfaces of ResNet-56 with and without skip connections

ResNet

Vanishing/Exploding Gradient:
This is one of the most common problems On the other end of the spectrum, there are cases
plaguing the training of larger/deep neural when the gradient reaches orders up to 10⁴ and more.
networks and is a result of oversight in terms As these large gradients multiply with each other, the
of numerical stability of the network’s values tend to move towards infinity. Allowing such a
parameters. large range of values to be in the numerical domain for
During back-propagation, as we keep moving weights makes convergence difficult to achieve.
from the deep to the shallow layers, the chain (Exploding Gradient)
rule of differentiation makes us multiply the
gradients. Often, these gradients are small, to
the order of 10^{-5} or more.
ResNet, due to its architecture, does not
According to some simple math, as these small allow these problems to occur at all.
numbers keep getting multiplied with each
The skip connections do not allow it as they act as gradient
other, they keep becoming infinitesimally
super-highways, allowing it to flow without being altered by a
smaller, making almost negligible changes to
large magnitude.
the weights.
(Vanishing Gradient)
The ResNet architecture is considered to be among the most popular
Convolutional Neural Network architectures around. Introduced by
Microsoft Research in 2015, Residual Networks (ResNet in short) broke
several records when it was first introduced.

Residual Block
Residual Block
• Skip connection skips training from a few
layers and connects directly to the output
• Instead of learning the underlying
mapping H(x) from stacked layers, let
network learn the residual F(x) = H(x)-x
• Hence, after adding identity, F(x)+x =
H(x)
• Speeds learning by reducing the impact of In mathematical terms, it would mean y=x+F(x) where y is
vanishing gradients, avoid degradation the final output of the layer.
In terms of architecture, if any layer ends up damaging the
• Enable development of Deeper Networks
performance of the model in a plain network, it gets
skipped due to the presence of the skip-connections.

ResNet
• Proposed by Shaoqing Ren, Kaiming He, Jian Sun, and Xiangyu Zhang
• Vanishing Gradient problem of Deep NN: with increase in depth
• 1st position in ILSVRC challenge 2015 (top-5 classification error of 3.57%)
• ResNet-50(50 conv layers) parameter count of approximately 25.6 million makes it

a moderately large network compared to earlier architectures.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition."
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016, DOI:
https://arxiv.org/abs/1512.03385.

What’s Novel in ResNet?
Residual Connection
High Accuracy
Bottleneck layers

Different versions of the ResNet architecture use a varying number of blocks at different levels.

What’s Novel in ResNet?
Their architecture reduced the number of parameters from 60 million

(AlexNet) to 4 million.
Popularized skip connections (they weren’t the first to use skip
connections).
Designing even deeper CNNs (up to 152 layers) without compromising
model’s generalization power.
Among the first to use batch normalization.

CNN: Utility of Layers

DenseNet Architecture

DenseNet, or Densely Connected Convolutional Network, is a type of convolutional neural network
that uses dense connections between layers.
DenseNets are feed-forward networks that connect each layer to every other layer. They are used
to increase the depth of a convolutional neural network
DenseNets have several advantages, including:
• Reduced gradient disappearance: DenseNets reduce the problem of vanishing gradients, which
is difficult to optimize in deep networks.
• Feature propagation: DenseNets strengthen feature propagation.
• Feature reuse: DenseNets encourage feature reuse.
• Number of parameters: DenseNets substantially reduce the number of parameters.
• Compact input features: DenseNet provides compact and differentiated input features by
shortcut connections of different lengths.

ResNet performs an element-wise addition to pass the output to the next layer or block. DenseNet connects all layers
directly to each other. It does this through concatenation.
With concatenation, each layer receives collective knowledge from the preceding layers.
Because of these dense connections, the model requires fewer layers, as there is no need to learn redundant feature
maps, allowing the collective knowledge (features learned collectively by the network) to be reused.

A DenseNet is a type of convolutional neural network that utilises dense connections between layers,
through Dense Blocks, where we connect all layers (with matching feature-map sizes) directly with each
other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers
and passes on its own feature-maps to all subsequent layers.


Resnet vs. DenseNet Architecture
When comparing DenseNet with ResNet, several key differences stand out:
• Skip Connections: ResNet uses skip connections to implement identity mappings, allowing gradients to
flow through the network without attenuation. DenseNet, on the other hand, uses dense connections,
concatenating feature maps from all preceding layers.
• Memory Usage: DenseNets generally require more memory than ResNets due to the concatenation of
feature maps from all layers. This can be a limiting factor in certain applications.
• Parameter Efficiency: DenseNet is often more parameter-efficient than ResNet. It reuses features
throughout the network, reducing the need to learn redundant feature maps.
• Training Dynamics: DenseNets might have a smoother training process due to the continuous feature
propagation throughout the network. However, this can also lead to increased training time and
computational costs.
• Performance: Both architectures have shown exceptional performance in various tasks. ResNet is often
preferred for very deep networks due to its simplicity and lower computational requirements. DenseNet
shines in scenarios where feature reuse is critical and can afford the additional computational cost.

Popular CNN Architectures: LeNet-5
• Proposed by LeCun et al in 1998
• Applied by several banks to recognise hand-written characters on cheques
digitized to 32x32 pixel greyscale input images
• 5 layers with learnable parameters,
7 layer in total
• 2 set of Conv-Subsampling
• 1 Conv, 1 FC
• 1 Output (10 units)
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document
recognition." Proc. IEEE 86, no. 11 (1998): 2278-2324.
LeNet-5
• Input: 32x32x1 greyscale images
∗
• C1: + 1 = 27 + 1 = 28
∗
• S1: + 1 = 13 + 1 = 14
∗
• C3: + 1 = 9 + 1 = 10
∗
• S4: +1= 4+1=5
∗
• C5: +1=1
7
• LeNet-5: Architecture has become the standard ‘template’: stacking

convolutions with activation function, and pooling layers, and ending the
network with one or more fully-connected layers.
Popular CNN Architectures
• ImageNet Large Scale
Visual Recognition
Challenge (ILSVRC) Winners
• AlexNet (1st, 2012)
• VGGNet (2nd, 2014)
• GoogLeNet (1st, 2014)
• ResNet (1st, 2015)

Transfer Learning

Transfer Learning
.

Transfer learning is a machine
learning technique that involves
using Knowledge gained from one
task to improve the performance of
a related task .
Or
Instead of training a model from
scratch for a new task, transfer
learning allow us to reuse a pre-
trained model on a related task and
fine tune it for the new task.
Gif source- https://deepnote.com/@jhon-smith-flores/Transfer-Learning-

864f7d51-84f9-4d43-baa0-6194de7943de
How TL works in case of Deep Learning Models?

Transfer Learning
• Improvement of learning in a new task through the transfer of
knowledge from a related task that has already been learned.
• Weight initialization for CNN
• Two major strategies

• ConvNet as fixed feature extractor
• Fine-tuning the ConvNet

2-When to finetune your model?
• New dataset is large and similar to the original dataset

• fine-tune through the some of the last layers
• New dataset is large and very different from the original dataset
• fine-tune through the some or entire network

Steps in Transfer Learning

1-Obtain pre-trained model
• VGG-16
• VGG-19
• Inception V3
• XCeption
• ResNet-50

2. Create a base model

3. Freeze layers
• Freezing the starting layers from the pre-trained model is essential to
avoid the additional work of making the model learn the basic
features.
• If we do not freeze the initial layers, we will lose all the learning that
has already taken place. This will be no different from training the
model from scratch and will be a loss of time, resources, etc.

4. Add new trainable layers
• The only knowledge we are reusing from the base model is the feature extraction layers. We need to
add additional layers on top of them to predict the specialized tasks of the model. These are
generally the final output layers.
5. Train the new layers
• The pre-trained model’s final output will most likely differ from
the output we want for our model. For example, pre-trained
models trained on the ImageNet dataset will output 1000
classes.
• However, we need our model to work for two classes. In this
case, we have to train the model with a new output layer in
place.

6. Fine-tune your model
• To extract more specific features for the new task without training the model from
scratch fine tuning can be used .
• Fine-tuning involves unfreezing some part of the base model and training the entire
model again on the whole dataset at a very low learning rate. The low learning rate will
increase the performance of the model on the new dataset while preventing overfitting.
• In this weights of the top layers of the pre-trained model are trained which will force the
weights to be tuned from generic feature maps to features associated specifically with
the data set.
• The first few layers learn very simple and generic features that generalize to almost all
types of images. As you go higher up, the features are increasingly more specific to the
dataset on which the model was trained.
• The goal of fine-tuning is to adapt these specialized features to work with the new
dataset, rather than overwrite the generic learning.
Freeze Fine Tune
• For this unfreeze the base_model and set the bottom layers to be un-trainable. Then recompile the model, and resume training.
base_model.trainable = True
# Let's take a look to see how many layers are in the base model
print("Number of layers in the base model: ", len(base_model.layers))
# Fine-tune from this layer onwards

fine_tune_at = 100
# Freeze all the layers before the `fine_tune_at` layer

for layer in base_model.layers[:fine_tune_at]:
layer.trainable = False
Application of Transfer Learning
• To extract more specific features for the new task without training the
model from scratch fine tuning can be used .
• In this weights of the top layers of the pre-trained model are trained which
will force the weights to be tuned from generic feature maps to features
associated specifically with the data set.
• The first few layers learn very simple and generic features that generalize to
almost all types of images. As you go higher up, the features are
increasingly more specific to the dataset on which the model was trained.
• The goal of fine-tuning is to adapt these specialized features to work with
the new dataset, rather than overwrite the generic learning.

Deep learning
• Course Code:
• Unit 3
Convolutional Neural Networks and
Transfer Learning
• Lecture 3
• Parameter sharing, receptive
field, 1D, 2D, 3D convolution,
Convolutional Neural Network
Understanding Receptive field
Field of view
• The human visual system consists of

millions of neurons, where each one
captures different information.
• Defined as neuron’s receptive field as the

patch of the total field of view Or what
information a single neuron has access
to..
ImageSource: https://www.brainhq.com/brain-resources/brain-
connection
Understanding Receptive field

Receptive Field in Deep Learning
• Defined as a size of region in input that

produces features. Basically, it is a measure of
association of an output feature (of any layer)
to the input region (patch).
• The idea of receptive fields applies to local
operations (i.e. convolution, pooling).
• A convolutional unit only depends on a local
region (patch) of the input.
• That’s why RF never referred on fully
5x5 3x3 connected layers since each unit has access
to all the input region.
Effective Receptive Field

Receptive Field in Deep Learning
Illustrating the total receptive field and total stride attributes for the L’th layer, which could be seen as the projected
receptive field and stride with respect to the input layer. Together, they capture the overlapping degree of a network.

Why do we care for Receptive Field?
The green and the orange one. Which one would you like to
have in your architecture?
Image Source: https://developer.nvidia.com/blog/image-segmentation-using-digits-5/
Why do we care for Receptive Field?
Therefore, our goal is to design a convolutional model so that we

ensure that its RF covers the entire relevant input image region.
Image Source: https://developer.nvidia.com/blog/image-segmentation-using-digits-5/
How to increase receptive field in a convolutional network?
• Add more convolutional layers (make the network deeper)
• Add pooling layers or higher stride convolutions (sub-sampling)

It is a technique that expands the kernel (input) by
• Use dilated convolutions inserting holes between its consecutive elements.

Dilated convolutions or “atrous convolutions”
Conventional Convolution vs. Dilated convolutions

Convolving a 3 × 3 kernel over a

7 × 7 input with a dilation factor
of 2 (i.e., i = 7, k = 3, d = 2, s = 1
and p = 0).

• Dilated convolutions “inflate”

the kernel by inserting spaces
between the kernel elements.
• The dilation “rate” is
controlled by an additional
hyperparameter d.
• Implementations may vary,
but there are usually d−1
spaces inserted between
kernel elements such that d =
1 corresponds to a regular
convolution

To understand the relationship tying the dilation rate d and the
output size o, it is useful to think of the impact of d on the
effective kernel size. A kernel of size k dilated by a factor d has an
effective size
= k + (k − 1)(d − 1)
For any i, k, p and s, and for a dilation rate d

Parameter Sharing
• Parameter sharing refers to using the • Kernel is reused (by sliding) when
same parameter for more than one calculating the layer o/p
function in a model • Less weights to store & train

Equivalent Representation
• Parameter sharing causes the layer to have a

Representation property called equivariance to translation
• Convolution creates 2-D map of where certain
features appear in the input
• If we move the object in the input, its
representation will move the same amount in
the output
Image • Eg: Same kernel for Edge Detection wherever
the edge occurs in the image

1D, 2D Convolution
• 2D Convolution
k
• 2-directions (x,y) to calculate conv H
• input = (WxHxc), d filters (kxkxc) output = k
(W1xH1xd)
• Eg: Image data (gray or color) c
W
• 1D Convolution
• 1-direction (time) to calculate conv
• input = (time-step x c), c
d filters (k x c) , output (time-step1 x d) k time-step
Eg: Time-series data, text analysis

2D, 3D Convolution
• 3D Convolution
• 3-directions (x,y,z) to
calculate conv
• input (WxHxLxC), m
filters (kxkxd) output
(W1xH1xL1xm)
• Eg: MRI data, Videos

3D Convolution: Example



Input shape =2D
1 D CNN Batch size =None
Width = Time axis =7
Feature map/ channels =1
Input shape = 3D
2 D CNN Height=5
Width = 7
Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels
=depth=1

Input shape =2D
Input shape = 3D
2 D CNN Height=5
Width = 7
Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels =depth=1

Input shape =2D
Input shape = 3D
2 D CNN Height=5
Width = 7
Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels =depth=1

•If the input has one channel such
as a grayscale image, then a 3×3
filter will be applied in 3x3x1
blocks.
•If the input image has three
channels for red, green, and blue,
then a 3×3 filter will be applied in
3x3x3 blocks.
•If the input is a block of feature
maps from another convolutional
or pooling layer and has the depth
of 64, then the 3×3 filter will be
applied in 3x3x64 blocks to create
the single values to make up the
single output feature map.

Convolution with Single Channel and Multiple Filters
• Input with 1 channel (eg. grayscale

image) then a 3×3 filter will be applied
in 3x3x1 blocks
Depth of
feature map=
number of
1 channels filters
Filters applied as
(kxkx1)

Convolution over Volume (Multiple Channels)
• RGB images has 3

channels:
Red, Green, Blue
• One kernel for every
input channel to the
layer (each kernel is
unique)
• Each filter = a collection
of kernels

Convolution with Multiple Channels and Multiple Filters
• Input with 1 channel (eg. grayscale

image) then a 3×3 filter will be applied
in 3x3x1 blocks
• Input with 3 channels ( eg red, green,
and blue for colour image), then a 3×3
filter will be applied in 3x3x3 blocks
• Input is block of feature maps from
another convolutional layer with depth
say 64, then the 3×3 filter will be Depth of
applied in 3x3x64 blocks to create a feature map=
single output feature map 3 channels number of
Filters applied as filters
(kxkx3)

Stride Stride is the
number of
pixels shifts
over the
input matrix.
left image: stride =0, middle image: stride = 1, right image: stride =2

Padding
In padding we add layers of zeroes around the input image matrix and This layer
of zeros is known as padding.
• Valid: when ‘padding = valid’, this means that no padding will Image with
be applied to the image or there will be no zeros added to padding =2
the image.
• Same: When “padding = same” this means that padding will

be applied to the image or zeros will be added to the image.

Padding
• It refers amount of pixels added to an image when it is being processed by kernel or filter.
• Half padding mean half of filter size and full padding mean padding equal to size of filter/kernel.
• Padding is done to reduce the loss of data among the sides/boundary of the image.
Padding effects output image size while filtering in Conv/padding layer (Assumption: Stride =1)
Convolution with Multiple Channels and Multiple Filters
• Input feature maps (axaxb) = (6x6x2),

on applying d filters,
• output feature map is (cxcxd) = (4x4x2)
Output Feature
2 Filters map has depth 2

Output Dimension

Determine the Output Dimension


• Input image, I with dimensions (32x32x3)

• Convolution Layer
• A filter size 3x3
• Stride is 1
• Valid padding, and
• Depth/feature maps are 5 (D =5)
• Output dimensions = ?

• Input image, I with dimensions (32x32x3)

• Convolution Layer
• A filter size 3x3
• Stride is 1 (s=1)
• Valid padding (p=0), and
• Depth/feature maps are 5 (D =5)
• Output dimensions = 30x30x5
• After Pooling?

• Input to Pooling Layer (30x30x5)

• After Pooling with
• Filter size
• Stride
•
30x30x5
• Eg, Pooling with, Filter size , Stride

• Output dimensions =

• Input to Pooling Layer (30x30x5)

• After Pooling with
• Filter size
• Stride
•
30x30x5 15x15x5
• Eg, Pooling with, Filter size , Stride

• Output dimensions =15x15x5

Typical CNN Model
Conv1 Convolution
2x2, Stride=2 Stride=1 2x2, Stride=2
Stride=1
• Conv1
• o/p: INPUT
28x28x1
10 Units
64 Units

Typical CNN Model
Conv1 Convolution
∗ 2x2, Stride=2 Stride=1 2x2, Stride=2
• Conv1 Stride=1
• o/p:
• Param: 3x3x1x8+8
= 80
• 3x3 filter for 1
channel, 8 such 16 Channels
8 Channels 8 Channels 16 Channels
filters and 8 biases INPUT
26x26x8
28x28x1
10 Units
64 Units

Typical CNN Model
Conv1 Convolution
Stride=1
• Conv1: Stride=1 2x2, Stride=2 2x2, Stride=2
• Max-Pool
o/p:
INPUT
26x26x8
28x28x1
10 Units
64 Units

Typical CNN Model
Conv1 Convolution
Stride=1
• Max-Pool
•
o/p:
• Conv2, INPUT
26x26x8 13x13x8
∗ 28x28x1
10 Units
64 Units

Typical CNN Model
Conv1 Convolution
• Conv1: Stride=1 2x2, Stride=2 Stride=1 2x2, Stride=2
• Max-Pool:
• Conv2:
∗
o/p:
INPUT
• Param? 28x28x1
26x26x8 13x13x8 11x11x16
10 Units
64 Units

Typical CNN Model
Conv1 Convolution
• Conv1: Stride=1 2x2, Stride=2 Stride=1 2x2, Stride=2
• Max-Pool:
• Conv2:
∗
o/p:
INPUT
• Param= 28x28x1
26x26x8 13x13x8 11x11x16
3x3x8x 16+16=1168 3x3 filter for 8 channel, 16 such filters 10 Units

64 Units
and 16 biases

Typical CNN Model
Conv1 Convolution
Stride=1
• Max-Pool:
• Conv2:
• Max-Pool:

INPUT
26x26x8 13x13x8 11x11x16 5x5x16
28x28x1
10 Units
64 Units

Typical CNN Model
Conv1 Convolution
Stride=1
• Max-Pool:
• Conv2:
• Max-Pool:
• 8 Channels 8 Channels 16 Channels 16 Channels

INPUT
13x13x8 5x5x16
• FC1: 28x28x1
26x26x8 11x11x16
10 Units
64 Units
• FC2: (64+1)×10=650

Parameters and Hyperparameters

Deep Learning
Course Code:
Unit 4: Sequential models &

Recurrent Neural Networks (RNNs)
Lecture 1: Introduction to RNNs

and their applications

Why do we need Sequential Modeling

Given a Football Image

Given a Football Image
Can you predict where

it will go next?

Now can you predict?

Since we have knowledge about the

kicking direction by the player, we can
predict its next direction.

Since we have knowledge about the

kicking direction by the player, we can
predict its next direction.

Need for Sequential Modeling
Sequence data Input data Output
Speech recognition Wow, it is so nice!
Machine translation You are my best friend Você é meu melhor amigo
Music generation
Name entity GH Hardy said, his GH Hardy said, his

recognition contribution was contribution was discovery
discovery of Ramanujan. of Ramanujan.

Need for Sequential Modeling
Sequence data Input data Output
Sentiment
Wow, it is so nice!
classification
DNA sequence
analysis
Video activity
recognition Fighting

Can we use ANN/CNN for Sequential Modeling?

Can we use ANN/CNN for Sequential Modeling?
No

Reasons:
 Fixed input size
Example : image size 32x32

Reasons:
Example : image size 32x32 Rabbit
Class
Dog
 Fixed output size Cat
Example : probabilities of different classes 0 0.5

Probability
1

Reasons:
Example : image size 32x32 Rabbit
Class
Dog
 Fixed output size Cat
Example : probabilities of different classes 0 0.5

Probability
1
 Fixed computational steps

Example : number of layers in the model
Image source: https://medium.com/techiepedia/binary-image-classifier-cnn-using-tensorflow-a3f5d6746697

Reasons:
Words learnt or approximated at a later position may change the
approximation of a previous word.
Example :
• Blue dresses are looking good.
• Blue dress is looking good.
 Parameter sharing is not done in conventional ANNs.

Reasons:
Words learnt or approximated at a later position may change the
approximation of a previous word.
Example :
• Blue dresses are looking good.
• Blue dress is looking good.
 Parameter sharing is not done in conventional ANNs.
There comes Recurrent Neural Network!

Reasons for Using Recurrent Neural Network (RNN)
 Can handle inputs and outputs of varying lengths.

 It involves directed cycles to recognize sequential characteristics of a
data.

data.
 Shares parameters across different parts of the network.

data.
 Track long-term dependencies.

data.
 Track long-term dependencies.
 Maintain information about order.

When to use RNN?
“Whenever there is a sequence of data and the temporal dynamics that

connects the data is more important than the spatial content of each
individual frame.”
– Lex Fridman (MIT)
Image source: https://commons.wikimedia.org/wiki/File:Lex_Fridman_teaching_at_MIT_in_2018.png

Neural Network: Simplified
Hidden
Input
Output
Weights
Standard feed-
forward network

Handling Individual Time Steps
s1
x1

s1 s2
x1 x2

s1 s2 s3
x1 x2 x3

s1 s2 s3 sn
x1 x2 x3 xn

s1 s2 s3 sn In general,
n = time step
x1 x2 x3 xn

s1 s2 s3 sn In general,
n = time step
 Same function is used
x1 x2 x3 xn
 Replicate network any number of times
 Ensure parameter sharing
 Number of timestep does not matter

s1 s2 s3 sn
How to maintain the

interdependency between input?
x1 x2 x3 xn

s1
A Simple Approach
x1
Let’s consider one approach

s1 s2
A Simple Approach
x1 x1 x2

s1 s2
A Simple Approach
x1 x1 x2
s3
x1 x2 x3

s1 s2
A Simple Approach
x1 x1 x2
s3 s4
x1 x2 x3 x1 x2 x3 x4

s1 s2
A Simple Approach
x1 x1 x2
Will this approach work?
s3 s4
x1 x2 x3 x1 x2 x3 x4

s1 s2
A Simple Approach
Problem
 Different function for different time-
step
x1 x1 x2 s1 = f1(x1)
s3 s4
s2 = f2(x1,x2)
s3 = f3(x1,x2,x3) ……
x1 x2 x3 x1 x2 x3 x4

s1 s2
A Simple Approach
Problem
 Different function for different time-
step
x1 x1 x2 s1 = f1(x1)
s3 s4
s2 = f2(x1,x2)
s3 = f3(x1,x2,x3) ……
 Depends on input length
x1 x2 x3 x1 x2 x3 x4

Recurrent Neural Network
s1 s2 s3 s4 sn
Solution
Add recurrent connection
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn

s1 s2 s3 s4 sn Solution
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn

h1 h2 h3 h4 hn
x1 x2 x3 x4 xn
input

h1 h2 h3 h4 hn
x1 x2 x3 x4 xn output
input

h1 h2 h3 h4 hn
x1 x2 x3 x4 xn output
input past memory

s1 s2 s3 s4 sn snn
hn
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn xnn
Can be represented
more compactly

A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn
xnn

snn
hn
xnn

snn
hn
xnn

snn
hn
xnn

snn
hn
xnn

snn
hn
xnn

snn
hn
xnn
 Same function is used
 Ensure parameter sharing
 Handles the temporal dependency between
sequence
Recurrent Neural Network Architectures
one to one
Vanilla NN

one to one one to many
Vanilla NN Image
Captioning

one to one one to many many to one
Vanilla NN Image Sentiment

Captioning Classification

one to one one to many many to one many to many
Vanilla NN Image Sentiment Name Entity

Captioning Classification recognition

one to one one to many many to one many to many many to many
Vanilla NN Image Sentiment Name Entity Machine

Captioning Classification recognition Translation

Deep Learning
Course Code:

Lecture 2: Introduction to RNNs

and their applications

RNN: Forward Propagation
Source: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn

Source: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn

Wxh
x1 x2 x3 x4

Whh
h0
Wxh
x1 x2 x3 x4

Activation Bias
Whh function
h0 h1
Wxh
x1 x2 x3 x4

s1
Wsh
Whh
h0 h1
Wxh
x1 x2 x3 x4

s1 s2
Wsh Wsh
Whh Whh
h0 h1 h2
Wxh Wxh
x1 x2 x3 x4

s1 s2 s3
Wsh Wsh Wsh
Whh Whh Whh
h0 h1 h2 h3
Wxh Wxh Wxh
x1 x2 x3 x4

s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Weight matrix , and remains same throughout

the forward propagation, thus ensuring parameter sharing
RNN: Back Propagation Through Time (BPTT)
Actual outputs
y1 y2 y3 y4
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Actual outputs
y1 y2 y3 y4
Loss calculation:
s1 s2 s3 s4
Loss
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh = Loss function
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh In general:
x1 x2 x3 x4

Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Assumptions:
and = least square function =
s1 s2 s3 s4
sh sh
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Assumptions:
Loss at each time step Gradient calculation wrt :
4 4
L1 L2 L3 L4
4
hh 4 hh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
hh
Whh Whh Whh Whh
h0 h1 h2 h3 h4 Now,
Wxh Wxh Wxh Wxh
Simply,
z4
Then, z4
x1 x2 x3 x4 hh hh
h3
h3
hh
Assumptions:
h2
and = least square function = h3 h2
hh

4
s1 s2 s3 s4
hh
Wsh Wsh Wsh Wsh g h3 h2
sh
h0
Whh Whh Whh Whh
h0 h1 h2 h3 h4 hh
Wxh Wxh Wxh Wxh Weight updation wrt :

x1 x2 x3 x4 hh hh
hh
Assumptions:
Loss at each time step Gradient calculation wrt :
4 4
L1 L2 L3 L4
4
xh 4 xh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
sh
xh
Whh Whh Whh Whh
h0 h1 h2 h3 h4 Now,
Wxh Wxh Wxh Wxh Simply,
z4
x1 x2 x3 x4 Then, z4
xh xh
Whh.h3
x4
Assumptions: xh
z2
and = least square function = x4 Whh z2
xh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
z2
sh x4 Whh z2
Whh Whh Whh Whh xh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Assumptions:
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
z2
sh x4 Whh z2
Whh Whh Whh Whh xh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
tf.keras.layers.SimpleRNN(rnn_units)
Assumptions:
Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.

Limitations of RNN
 This may lead to:
Exploding Gradient

Limitations of RNN
Exploding Gradient Vanishing Gradient

Limitations of RNN

Limitations of RNN

Limitations of RNN
 make learning unstable

Limitations of RNN
 make learning unstable  Short term dependencies

“the stars shine in the ?”  sky (RNN works
good here)
 Long term dependencies
“I grew up in Spain…........…………………… I speak
fluent Spanish”. (Difficult for RNN to remember
as gap increases)
Possible Solutions
 Gradient clipping

Possible Solutions
 Gradient clipping  Activation function (Relu)

 Weight initialization (identity
matrix)
 Gated cells (LSTM,GRU,etc)

Deep Learning
Course Code:

Lecture 3: Long Short-Term

Memory (LSTM) networks

Long Short-Term Memory (LSTM)
 A special kind of RNN, capable of learning long-term dependencies.
 Introduced by Hochreiter & Schmidhuber (1997).
 Keeping relevant information for long period of time is their default
behavior.
 Have been refined and popularized by many researchers.
 Successfully applied in many problems that have sequential behavior.

Selective Read, Selective Write, Selective Forget
– The Whiteboard Analogy

Problems with RNN
 sn holds information of previous time

s1 s2 s3 sn
steps
Wsh Wsh Wsh Wsh
h1 h2 h3 hn
Wxh Wxh Wxh Wxh

x1 x2 x3 xn

Problems with RNN

s1 s2 s3 sn
steps
Wsh Wsh Wsh Wsh
h1 h2 h3 hn  Information stored at time step n-k (for

some k<n) gets completely morphed
Wxh Wxh Wxh Wxh
x1 x2 x3 xn

Problems with RNN

s1 s2 s3 sn
steps
Wsh Wsh Wsh Wsh
h1 h2 h3 hn  Information stored at time step n-k (for

some k<n) gets completely morphed
Wxh Wxh Wxh Wxh
x1 x2 x3 xn  Similar problem when going backwards
(backpropagation)

Whiteboard Analogy
Let us see an analogy for this
Image source:https://prvnk10.medium.com/the-whiteboard-analogy-to-deal-vanishing-and-exploding-gradients-1c0d47bfd6e1

Whiteboard Analogy
 Selectively write

Whiteboard Analogy
 Selectively read

Whiteboard Analogy
 Selectively read
 Selectively forget

Derive an Expression on Whiteboard
Compute
Say “board” can have only 3 statements at a  Selectively write
time.

Compute
time.
𝑎𝑐 = 17

Compute
time.
𝑎𝑐 = 17
𝑏𝑑 = 50

Compute
Say “board” can have only 3 statements at a  Selectively read
time.
𝑎𝑐 = 17
𝑏𝑑 = 50

Compute
Say “board” can have only 3 statements at a  Selectively read
time.
𝑎𝑐 = 17
𝑏𝑑 = 50
𝑏𝑑 + 𝑎 = 52

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.
𝑎𝑐 = 17
𝑏𝑑 = 50
𝑏𝑑 + 𝑎 = 52

Compute
time.
𝑎𝑐 = 17
𝑏𝑑 + 𝑎 = 52

Compute
time.
𝑎𝑐 = 17
𝑎𝑐(𝑏𝑑 + 𝑎) = 884
𝑏𝑑 + 𝑎 = 52

Compute
time.
𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728
𝑎𝑑 = 20

Compute
time.
𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728
𝑎𝑑 = 20

Compute
time.
RNN has finite state size.

𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728 Thus, we need selective read,
𝑎𝑑 = 20 selective write and selective forget!!

Understand the concept using real-time example
+/-
What is the sentiment of the review?
x1 x2 x3 xn
The First ... performance
Review: The first half of the movie was dry but the
second half really picked up pace. The lead actor
delivered an amazing performance.

+/-

x1 x2 x3 xn
The First ... performance  Selectively read
Review: The first half of the movie was dry but the
second half really picked up pace. The lead actor  Selectively forget

+/-

x2 x3
Helps to
x1 xn
store only
The First ... performance  Selectively read
important
Review: The first half of the movie was dry but the information
second half really picked up pace. The lead actor  Selectively forget

sn
 Computational block Cn-1 × + Cn

tanh
fn
×
 Track information ×
σ σ tanh σ
 Maintain a cell state hn-1 hn
 Use Gates xn

How do LSTM work?

a) Input
b) Forget
c) Update
d) Output
Source: https://medium.com/analytics-vidhya/tagged/lstm

sn
How do LSTMs work?
a) Forget Cn-1 × +
tanh
b) Input fn in
× ×
c) Update
σ σ tanh σ
d) Output hn-1
Forget gate gets rid of xn

irrelevant information

sn
How do LSTMs work?
a) Forget Cn-1 × +
tanh
b) Input fn in
× ×
c) Update
σ σ tanh σ
d) Output hn-1


sn
How do LSTMs work?
a) Forget Cf
Cn-1 × +
tanh
b) Input fn in
× ×
c) Update
σ σ tanh σ
d) Output hn-1


sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
b) Input in tanh
×
gn ×
c) Update σ σ tanh σ
d) Output hn-1
Input gate stores relevant xn

information from current
input
sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
b) Input in tanh
×
gn ×
d) Output hn-1

input
sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
in Ci tanh
b) Input ×
gn ×
d) Output hn-1

input
sn
How do LSTMs work?
a) Forget Cn-1 × Cf + Cn
b) Input Ci tanh
× ×
c) Update
σ σ tanh σ
d) Output hn-1
Update gate selectively xn

update cell state value

sn
How do LSTMs work?
a) Forget Cn-1 × Cf + Cn
b) Input Ci tanh
× ×
c) Update
σ σ tanh σ
d) Output hn-1
Update gate selectively xn

update cell state value

sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1
Output gate returns a xn

filtered version of the
cell state
sn
How do LSTMs work?
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1

cell state
sn
How do LSTMs work?
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1 hn

cell state
LSTM Gradient Flow
s1 s2 s3
× + × + × +
tanh tanh tanh C3
C0 fn fn fn
× × × × × ×
σ σ tanh σ σ σ tanh σ σ σ tanh σ
x1 x2 x3
Uninterrupted gradient flow

LSTM Gradient Flow
BPTT in LSTM is similar to BPTT in RNN.

Complexity of the derivatives increases due to presence of gates
Detailed information on BPTT of LSTM can be found at
https://kartik2112.medium.com/lstm-back-propagation-behind-the-scenes-andrew-
ng-style-notations-7207b8606cb2
tf.keras.layers.LSTM(lstm_units)


DNN Full Merged Compressed Compressed

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DNN Full Merged Compressed Compressed

Uploaded by

Copyright:

Available Formats

Deep learning

Loss Function, Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• MSE finds the average of squared differences b/w

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• As MSE is highly sensitive to outliers,

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• Weights are on x and y axis whereas

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• Take small step in opposite direction

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

2. Loop until finding the convergence:

Amity Centre for Artificial Intelligence, Amity University, Noida, India

2. Loop until finding the convergence:

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Algorithm for gradient descent:

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Algorithm for gradient descent:

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Algorithm for gradient descent:

3. Pick batch of B data points

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• After each forward pass through a

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• The level of adjustment is

Amity Centre for Artificial Intelligence, Amity University, Noida, India

• Function C derivative measures the sensitivity to

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Let’s use the chain rule!

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Split this into the gradient of our loss 𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ

Now if we want to repeat this process with a

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Repeat this for every weight in the 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

Amity Centre for Artificial Intelligence, Amity University, Noida, India

High Bias Large error in training as well as testing data