Professional Documents
Culture Documents
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾
5. Return weights
• The amount that the weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
• The magic line here is actually how to you compute that gradient – that’s something not easy at all. So the question is
given a loss given all of our weights in our network how do we know which way is good –which way is a good place to
move. - That’s a process called back-propagation. We will discuss back propagation using elementary calculus.
How does a small change in one weight (ex. W2) affect the final loss J (W)?
• This is a Simple network with one input layer, one hidden layer (one hidden neuron) and one output layer,
the simplest neural network you can create.
• Computing the gradient of loss of W with respect to w2 ( that is between hidden state and W) can perform
lot of changes in loss value. We actually want to see - How does a small change in one weight (ex. W2)
affect the final loss J (W)?
• So this derivative is going to tell us how much a small change in this weight will affect our loss if we make
a small change in the weight, in one Direction will it increase our loss or decrease our loss.
• Like how a small change in w2 – makes how much change – up or down – how does it change – and by
how much really !
𝜕𝐽(𝑾)
Gradient loss of W with respect to w2
𝜕 𝒘𝟐
So, to compute that we can use this derivative, we can start with applying the chain
rule backwards from the loss function through the output specifically.
So that’s the gradient we care about – The gradient of our loss with respect to w2.
• We can do is we can actually just decompose this derivative into two components the first component
• To evaluate this we can use the chain rule in elementary calculus.
• We can split them into gradient of the loss with respect to our output multiplied by gradient of output s with
respect to w2.
• The derivative of the loss with respect to our output multiplied by the derivative of our output with respect
to W2, this is just a standard use of the chain rule with this original derivative that we had on the left hand
side
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)
Here it can be seen on the red component that last component of the chain rule, we have to once again recursively apply one more chain
rule because that's again another derivative that we can't directly evaluate. We can expand that once more with another instantiation of
the chain Rule and now all of these components.
We can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're
interested. In in this example we first computed the derivative with respect to W2 then we back propagated and used that information also
with W1. That's why we really call it back propagation because this process occurs from the output all the way back to the input
Repeat this process essentially many times over the course of training by back-propagating.
These gradients over and over again through the network all the way from the output to the inputs to
determine for every single weight answering this question which is how much does a small change in these
weights affect our loss function if it increases it or decreases and how we can use that to improve the loss
ultimately because that's our final goal so that's the back propagation algorithm that's the core of training
neural networks.
Covariant Shift
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why Normalization
Example,
Example,
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Bias Variance Tradeoff
Bias and variance typically trade off in relation to model complexity
To optimize the value of the total error for the model by using
the Bias-Variance Tradeoff:
The best fit will be given by the hypothesis on the tradeoff point.
One possible strategy: Take the average of the last several values.
This might work in certain cases but it is not very suitable for scenarios
when a parameter is more dependent on the most recent values.
It is based on the assumption that more recent values of a variable contribute more to the formation
of the next value than precedent values.
The famous
second
wonderful limit • Taking β = 0.9 indicates that
approximately in t = 10 iterations, the
By making a weight decays to 1 / e, compared to the
substitution weight of the current observation.
β=1-x • In other words, the exponential
weighted average mostly depends only
on the last t = 10 observations.
As in the equation for the exponential moving
average, every observation value is multiplied by a
term βᵗ . Then on comparing both forms:
• In this example, the starting point and the local minima have different horizontal coordinates and are almost equal vertical
coordinates.
• Using gradient descent to find the local minima will likely make the loss function slowly oscillate towards vertical axes.
• These bounces occur because gradient descent does not store any history about its previous gradients making
gradient steps more undeterministic on each iteration.
Thus, large learning rate disconvergence.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
Gradient Descent
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
It would be desirable to make a loss function performing larger
steps in the horizontal direction and smaller steps in the
vertical.
Momentum uses a pair of
equations at each iteration:
Exponentially moving
average for gradient
values dw The momentum term increases for dimensions
Normal gradient descent whose gradients point in the same directions
update using the computed and reduces updates for dimensions whose
moving average value on the gradients change directions. As a result, we
current iteration. gain faster convergence and reduced oscillation
(An overview of gradient descent optimization
algorithms∗ Sebastian Ruder)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
Instead of simply using them for updating weights, we take several Momentum usually converges
past values and literaturally perform update in the averaged direction. much faster than gradient
descent. With Momentum,
there are also fewer risks in
using larger learning rates,
thus accelerating the training
process.
Optimization
with Momentum
In Momentum, it is
recommended to choose
β close to 0.9.
projected
gradient
V initialised to 0
The intuition is that the standard momentum method first computes the
gradient at the current location and then takes a big jump in the direction of the
updated accumulated gradient. In contrast Nesterov momentum first makes a
big jump in the direction of the previous accumulated gradient and then
measures the gradient where it ends up and makes a correction. The idea being
that it is better to correct a mistake after you have made it.
AdaGrad (white) vs. gradient descent (cyan) on a terrain with a saddle point. The learning rate of AdaGrad is set to be
higher than that of gradient descent, but the point that AdaGrad’s path is straighter stays largely true regardless of
learning rate.This property allows AdaGrad (and other similar
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AdaGrad (Adaptive Gradient Algorithm)
From the animation, it can
be seen that Adagrad
might converge slower
compared to other
methods. This could be
because the accumulated
gradient in the
denominator causes the
learning rate to shrink and
become very small,
thereby slowing down the
learning over time.
last square
gradient at
every iteration
• A little positive aspect about this algorithm is the fact only a single bit is required to
store signs of gradients which can be handy in distributed computations with strict
memory requirements.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSProp (Root Mean Square Propagation)
RMSProp was elaborated as an improvement over AdaGrad which tackles the
issue of learning rate decay. exponentially moving average
• However, instead of storing a cumulated sum of squared
gradients dw² for vₜ, the exponentially moving average is
calculated for squared gradients dw².
RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run also shows the
sum of gradient squared represented by the squares.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)
• Adam is the most famous optimization algorithm in deep learning.
• Adam combines Momentum and RMSProp algorithms. To achieve it, it simply keeps
track of the exponentially moving averages for computed gradients and squared
gradients respectively.
• Furthermore, it is possible to use bias correction for moving averages for a more
precise approximation of gradient trend during the first several iterations.
• The experiments show that Adam adapts well to almost any type of neural network
architecture taking the advantages of both Momentum and RMSProp.
first
momentum.
Updated weight
Second momentum.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)
Disadvantage
It doesn’t focus on data points rather focus on computation time
Note: So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary- Optimizers
Remember:
Optimization through gradient descent
W ←W − ƞ
Remember:
Optimization through gradient descent
W ←W − ƞ
J(W)
Initial guess
W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Large learning rates overshoot, become unstable and diverge which is more undesirable.
J(W)
Initial guess
W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Setting learning rate is very challenging.
• Stable learning rates converge smoothly and avoid local minima
J(W)
Initial guess
W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape
Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly
Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape
constant
small positive to
different learning avoid division by 0
rates at each iteration
𝝏𝑱 𝝎
𝒗 𝒘, 𝒕 + 𝟏 = 𝜸 𝒗 𝒘, 𝒕 + (1- 𝜸) ( )
𝝏𝝎
Momentum or
forgetting factor,
usually 0.9
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSprop (Root mean square propogation)
• Advantage:
It reduces monotonical decrease in learning rate as in
AdaGrad.
• mt and vt initialized as 0,it is observed that they gain a tendency to be ‘biased towards 0’ as
both β1 & β2 ≈ 1. fixes this problem by computing ‘bias-corrected’ mt and vt. This control
the weights while reaching the global minimum to prevent high oscillations when near it.
• Algorithm has a faster running time, low memory requirements, and requires less tuning
than any other optimization algorithm.
Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 3
Overfitting and
underfitting bias variance trade
off
Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation
Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation
Training set
Many neurons
Slide source: Coding Lane
Regularization
Cost function =
For linear regression line, let’s consider two
points that are on the line,
Cost function =
For linear regression line, let’s consider two 1.96
points that are on the line,
Cost function =
For ridge regression line, let’s assume, Ridge regression line
0.63
=
=1
= 0.7
Then, Cost function =
Cost function =
Here,
= Sum of the squared residuals
= Penalty for the errors
= Slope of the curve/line
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing • Number of Iterations
(epochs) is a
Training
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training (epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing have a chance to overfit
Training • Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here!
(Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing
have a chance to overfit
Training
• Number of Iterations
Under-fitting Over-fitting
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here! (Underfit)
• Too many epochs=>
Overfitting
Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we have a chance
to overfit
• Number of Iterations (epochs) is a
hyperparameter
• Less epochs=> Suboptimal solution
(Underfit)
• Too many epochs=> Overfitting
https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
Amity Centre for Artificial Intelligence, Amity University, Noida, India
if we stabilize the input values for
each layer (defined as z = Wx +
b, where z is the linear
transformation of the W
weights/parameters and the biases),
we can prevent our activation
Fig. From gradient it can be observed that
function from putting our input larger z , the function approaches zero, When
values into the max/minimum network’s nodes exist in this space, training
values of our activation function slows down significantly, since gradient values
decrease.
no. of neurons
at layer h
• Normalize the hidden activations by this subtracting the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).
• γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the
vector containing values from the previous operations.
• For a grayscale images, the pixel value is a single number that represents the brightness of the pixel. The most common pixel
format is the byte image, where this number is stored as an 8-bit integer giving a range of possible values from 0 to 255.
• Similarly for color images, each level is represented by the range of decimal numbers from 0 to 255 (256 levels for each
color), equivalent to the range of binary numbers from 00000000 to 11111111, or hexadecimal 00 to FF. The total number of
available colors is 256 x 256 x 256, or 16,777,216 possible color.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 86%
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 7%
𝑀𝑀. 𝑆𝑆. 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 5.8%
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1.2%
Classification
Input Image Pixel
Representation
Classification
Model
CAR
Edge, Corners
Step 1:
Extract Set of Local Features by
applying filters (set of weights)
Step 2:
Apply Multiple Filters for
extraction of different features
Step 3:
Spatial Sharing of parameters
for each filter
Input Image (Array of Pixels)
Input Image
(2-D array of
pixels)
Convolutional Neural
Network X or O
Convolutional Neural
Network X
Convolutional Neural
Network O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Challenging Cases
Rotation Weighted Translation Scaling
Convolutional Neural
Network X
Convolutional Neural
Network O
=
Human
Interpretation
=
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
Computer -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
Interpretation -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
Pixel wise
-1 -1 -1 -1 1 -1 -1 -1 -1
Matching -1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=x
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
Decision -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
=
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Piece Matching of Features
Features
1 -1 1
-1 1 -1
1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
=
1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
Identity Kernel
Blur
Left Sobel
Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction
Right Sobel
Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction
Bottom Sobel
Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction
Top Sobel
Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction
Emboss
Outline
Sharpen
The sharpen kernel emphasizes differences in adjacent pixel values. This makes the image look more vivid.
Depth
32 Dimensions of Layer
H*W*D
Height
H (height) and W (width)
are spatial dimensions
whereas D (depth) is
Width number of filters
32
3
Stride = Step size of filter, Receptive Field = Location of connected path in an input image
g(x) = max(0, x)
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11
-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0 0.33 0 1.00 0 0.33 0
0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11
-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0
0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33
0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77
-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0
0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11
0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33
0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55
-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0
0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33
Subsampling
-4
-4 8*(-1) +
0*(-1) +
5*(-1)
filter_size/2 zeros
(integer division)
• Output dimension is
the same as the
input for s=1 1
……
6 x 6 image
Each filter detects a small pattern (3 x 3).
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1 Dot
product
1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
6 x 6 image
Amity Centre for Artificial Intelligence, Amity University, Noida, India
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map
0 1 Two 3X3 Kernels
-1 -1 -2 1
0 0 1 0 1 0 Forming 4 x 4 x 2 matrix
3 -2 -2 -1
6 x 6 image -1 0 -4 3
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
connected 1 0 0 0 1 0
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
1 0 0 0 0 1
…
0 1 0 0 1 0 7 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 0
…
0 0 1 0 1 0
13 0 Only connect to
6 x 6 image 9 inputs, not
14 0
fully connected
fewer parameters! 15 1
…
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
6 x 6 image 13: 0
Fewer parameters 14: 0
Shared weights
15: 1
16: 1
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Click the link below or copy paste the URL in your browser
https://poloclub.github.io/cnn-explainer/
With the CNN Explainer you can Learn and implement Convolutional Neural
Network (CNN) in your browser! With real sample image dataset
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 Conv. Acti. Fun. Conv. Acti. Fun. Pooling Conv. Acti. Fun.
-1 -1 1 -1 -1 -1 1 -1 -1 Layer (ReLU) Layer (ReLU) (Max- Layer (ReLU)
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 Pooling)
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Training Phase)
X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Training Phase)
X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Training Phase)
X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Testing Phase)
0.9
X
0.65
0.96
0.96 0.73
0.73
0.23 0.63
0.23
O
0.63
0.44 0.89
0.44
0.94 0.53
0.89
0.94
0.53
X
0.65
0.9 0.65
0.45 0.87
0.45
0.87
0.912
0.96
0.23 0.63
0.23
O
0.63
0.44 0.89
0.44
0.94 0.53
0.89
0.94
0.53
X
0.65
0.96
0.23 0.63
0.23
O
0.63
0.44 0.89
0.44
0.94 0.53
0.89
0.94
0.53
X
0.65
0.96
0.23 0.63
0.23
O
0.63
0.44 0.89
0.94 0.53
0.44
0.517
0.89
0.94
0.53
0.65
X
0.45
0.87
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.96
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 0.73
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 0.23
O
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.63
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.44
0.89
0.94
Fully Connected
Layer
0.53
0.9
X
0.65
0.45
0.87
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.96
-1 -1 1 -1 -1 -1 1 -1 -1
0.73
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 0.23
O
-1 -1 -1 1 -1 1 -1 -1 -1
0.63
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.44
Fully Fully Fully
0.89 Connected Connected Connected
Layer 1 Layer 2 Layer 3
0.94
0.53
Classification
Class A
Input Class B
Image
Class C
Class D
• Convolutional layer and Pooling help to extract high level features of input
• Fully connected layer used extracted high level features for classification of input image in different classes
• Output also include the class probability of the image
Stride is the
number of
pixels shifts
over the
input matrix.
left image: stride =0, middle image: stride = 1, right image: stride =2
If an image is 100×100, a filter is 6×6, the padding is 7, and the stride is 4, the result of convolution will be
(100 – 6 + (2)(7)) / 4 + 1 = 28×28.
Cat
Cat
1 2 .5 0
1 2 .5 0 2
1 2 .5 0 2 0
+
Time series Signal
2 Padding = Same
0 1 0 0 0 0 -1 0 0 0 0
Inverted Kernel
Padding = Same
1 2 .5
=
Time series Signal
0 .5 2 1 0 0 -.5 -2 -1 0 0
Padding = Same
Convolution
Inverted Kernel
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
Padding = Same
Convolution
Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 -.5 -2 -1 0 0 Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 0 0 -1 0 0 Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 0 0 0 0 0 Result
0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal
ReLu Activation
0 .5 2 1 0 0 0 0 0 0 0 Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
ReLu Activation
Result
max
max
max
max
max
X X
Pooling
max
Result
X X
Pooling
max
Result
X X
Pooling
max
Result
X X
Pooling
max
Result
X X
Pooling
max
Result
Algorithms that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2010-2017. The top-5 error refers to the
probability that all top-5 classifications proposed by the algorithm for the image are wrong. The algorithms with blue graph are
convolutional neural network. Although VGGNet took second place in 2014, it is widely used in studies as its concise structure.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AlexNet
AlexNet is a pioneering
convolutional neural network
(CNN) used primarily for image
recognition and classification
tasks.
It won the ImageNet Large Scale
Visual Recognition Challenge in
2012, marking a breakthrough in
deep learning. AlexNet’s
architecture, with its innovative
use of convolutional layers and
rectified linear units (ReLU), laid Alexnet won the Imagenet large-scale visual recognition
the foundation for modern deep challenge in 2012. The model was proposed in 2012 in the
learning models, advancing research paper named ”Imagenet Classification with Deep
computer vision and pattern Convolution Neural Network” by Alex Krizhevsky and his
recognition applications. colleagues.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1. The Alexnet has eight layers with learnable
parameters.
2. The model consists of five layers with a
combination of max pooling followed by 3 fully
connected layers and they use Relu activation in
each of these layers except the output layer.
3. They found out that using the Relu activation
function accelerated the speed of the training by
almost six times.
4. They also used the dropout layers, that prevented
their model from overfitting. The model is trained
on the Imagenet dataset.
5. Total No. of parameters in this architecture is
62.3 million.
6x6x256=9216
where (W,H) are the width and height of the feature map. The
only difference between Inter and Intra Channel LRN is the
neighborhood for normalization. In Intra-channel LRN, a 2D
neighborhood is defined (as opposed to the 1D neighborhood
in Inter-Channel) around the pixel under-consideration.
Conv= 3x3 filter, s=1, same ReLU activation in all hidden units
Max pool= 2x2, s=2 (5 Max pooling layers) Softmax activation in output units
Input: The VGGNet takes in an image input size of 224×224. For the ImageNet competition,
the creators of the model cropped out the center 224×224 patch in each image to keep the
input size of the image consistent.
Convolutional Layers:
VGG’s convolutional layers leverage a minimal receptive field, i.e., 3×3, the smallest possible
size that still captures up/down and left/right. Moreover, there are also 1×1 convolution filters
acting as a linear transformation of the input. This is followed by a ReLU unit, which is a huge
innovation from AlexNet that reduces training time.
The convolution stride is fixed at 1 pixel to keep the spatial resolution preserved after
convolution (stride is the number of pixel shifts over the input matrix).
Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not usually
leverage Local Response Normalization (LRN) as it increases memory consumption and
training time. Moreover, it makes no improvements to overall accuracy.
Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the three
layers, the first two have 4096 channels each, & the third has 1000 channels, 1 for each class.
5x5 3x3
Effective Receptive Field
Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
"Going deeper with convolutions." In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1-9. 2015.
9 Inception Modules
Final
Classifier
Auxiliary Classifier
Auxiliary Classifier
Auxiliary Classifier
Auxiliary Classifiers
Auxiliary Classifier
• Intermediate softmax branches at
• 5×5 Average
the middle Pooling (Stride 3)
• Only used during training • 1×1 Conv (128
filters)
• Purpose: combating vanishing • 1024 FC
gradient problem, regularization • 1000 FC
• Softmax
• Loss is added to the total loss,
with weight 0.3
27
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
What’s Novel in GoogleNet?
Inception module,
1x1 convolutions,
Global average pooling,
Auxiliary classifiers,
Increased network depth(22 layers).
• Vanishing gradient
problem is mainly occurs
with sigmoid and tanh
functions.
“The gradients will be very small for the earlier layers, means there is no major
difference between the new weight and old weight.”
• This is known as
the exploding
gradients problem.
• Proposed by Shaoqing Ren, Kaiming He, Jian Sun, and Xiangyu Zhang
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition."
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016, DOI:
https://arxiv.org/abs/1512.03385.
Residual Connection
High Accuracy
Bottleneck layers
Because of these dense connections, the model requires fewer layers, as there is no need to learn redundant feature
maps, allowing the collective knowledge (features learned collectively by the network) to be reused.
• Skip Connections: ResNet uses skip connections to implement identity mappings, allowing gradients to
flow through the network without attenuation. DenseNet, on the other hand, uses dense connections,
concatenating feature maps from all preceding layers.
• Memory Usage: DenseNets generally require more memory than ResNets due to the concatenation of
feature maps from all layers. This can be a limiting factor in certain applications.
• Parameter Efficiency: DenseNet is often more parameter-efficient than ResNet. It reuses features
throughout the network, reducing the need to learn redundant feature maps.
• Training Dynamics: DenseNets might have a smoother training process due to the continuous feature
propagation throughout the network. However, this can also lead to increased training time and
computational costs.
• Performance: Both architectures have shown exceptional performance in various tasks. ResNet is often
preferred for very deep networks due to its simplicity and lower computational requirements. DenseNet
shines in scenarios where feature reuse is critical and can afford the additional computational cost.
LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document
recognition." Proc. IEEE 86, no. 11 (1998): 2278-2324.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
LeNet-5
• Input: 32x32x1 greyscale images
∗
• C1: + 1 = 27 + 1 = 28
∗
• S1: + 1 = 13 + 1 = 14
∗
• C3: + 1 = 9 + 1 = 10
∗
• S4: +1= 4+1=5
∗
• C5: +1=1
7
• The only knowledge we are reusing from the base model is the feature extraction layers. We need to
add additional layers on top of them to predict the specialized tasks of the model. These are
generally the final output layers.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
5. Train the new layers
• The pre-trained model’s final output will most likely differ from
the output we want for our model. For example, pre-trained
models trained on the ImageNet dataset will output 1000
classes.
• However, we need our model to work for two classes. In this
case, we have to train the model with a new output layer in
place.
ImageSource: https://www.brainhq.com/brain-resources/brain-
connection
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Understanding Receptive field
Illustrating the total receptive field and total stride attributes for the L’th layer, which could be seen as the projected
receptive field and stride with respect to the input layer. Together, they capture the overlapping degree of a network.
The green and the orange one. Which one would you like to
have in your architecture?
Image Source: https://developer.nvidia.com/blog/image-segmentation-using-digits-5/
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why do we care for Receptive Field?
= k + (k − 1)(d − 1)
• Parameter sharing refers to using the • Kernel is reused (by sliding) when
same parameter for more than one calculating the layer o/p
function in a model • Less weights to store & train
• 2D Convolution
k
• 2-directions (x,y) to calculate conv H
• input = (WxHxc), d filters (kxkxc) output = k
(W1xH1xd)
• Eg: Image data (gray or color) c
W
• 1D Convolution
• 1-direction (time) to calculate conv
• input = (time-step x c), c
Input shape = 3D
2 D CNN Height=5
Width = 7
Feature map/ channels =1
Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels
=depth=1
Input shape = 3D
2 D CNN Height=5
Width = 7
Feature map/ channels =1
Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels =depth=1
Input shape = 3D
2 D CNN Height=5
Width = 7
Feature map/ channels =1
Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels =depth=1
Depth of
feature map=
number of
1 channels filters
Filters applied as
(kxkx1)
left image: stride =0, middle image: stride = 1, right image: stride =2
Padding effects output image size while filtering in Conv/padding layer (Assumption: Stride =1)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Convolution with Multiple Channels and Multiple Filters
Output Feature
2 Filters map has depth 2
• o/p: INPUT
28x28x1
8 Channels 8 Channels 16 Channels 16 Channels
10 Units
64 Units
• o/p:
• Param: 3x3x1x8+8
= 80
• 3x3 filter for 1
channel, 8 such 16 Channels
8 Channels 8 Channels 16 Channels
filters and 8 biases INPUT
26x26x8
28x28x1
10 Units
64 Units
• Max-Pool
o/p:
8 Channels 8 Channels 16 Channels 16 Channels
INPUT
26x26x8
28x28x1
10 Units
64 Units
• Max-Pool
•
o/p:
• Conv2, INPUT
8 Channels 8 Channels 16 Channels 16 Channels
26x26x8 13x13x8
∗ 28x28x1
10 Units
64 Units
• Max-Pool:
• Conv2:
∗
o/p:
8 Channels 8 Channels 16 Channels 16 Channels
INPUT
• Param? 28x28x1
26x26x8 13x13x8 11x11x16
10 Units
64 Units
• Max-Pool:
• Conv2:
∗
o/p:
8 Channels 8 Channels 16 Channels 16 Channels
INPUT
• Param= 28x28x1
26x26x8 13x13x8 11x11x16
• Max-Pool:
• Conv2:
• Max-Pool:
• Max-Pool:
• Conv2:
• Max-Pool:
10 Units
64 Units
• FC2: (64+1)×10=650
Machine translation You are my best friend Você é meu melhor amigo
Music generation
DNA sequence
analysis
Video activity
recognition Fighting
Class
Dog
Fixed output size Cat
Class
Dog
Fixed output size Cat
Hidden
Input
Output
Weights
Standard feed-
forward network
s1
x1
s1 s2
x1 x2
s1 s2 s3
x1 x2 x3
s1 s2 s3 sn
x1 x2 x3 xn
s1 s2 s3 sn In general,
n = time step
x1 x2 x3 xn
s1 s2 s3 sn In general,
n = time step
Same function is used
x1 x2 x3 xn
Replicate network any number of times
Ensure parameter sharing
Number of timestep does not matter
s1 s2 s3 sn
x1 x2 x3 xn
x1
Let’s consider one approach
x1 x1 x2
Let’s consider one approach
x1 x1 x2
Let’s consider one approach
s3
x1 x2 x3
x1 x1 x2
Let’s consider one approach
s3 s4
x1 x2 x3 x1 x2 x3 x4
x1 x1 x2
Will this approach work?
s3 s4
x1 x2 x3 x1 x2 x3 x4
x1 x2 x3 x1 x2 x3 x4
x1 x2 x3 x1 x2 x3 x4
s1 s2 s3 s4 sn
Solution
Add recurrent connection
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn
s1 s2 s3 s4 sn Solution
Add recurrent connection
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn
s1 s2 s3 s4 sn Solution
Add recurrent connection
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn
input
s1 s2 s3 s4 sn Solution
Add recurrent connection
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn output
input
s1 s2 s3 s4 sn Solution
Add recurrent connection
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn output
input past memory
s1 s2 s3 s4 sn snn
hn
h1 h2 h3 h4 hn
x1 x2 x3 x4 xn xnn
Can be represented
more compactly
xnn
xnn
xnn
xnn
xnn
xnn
xnn
Same function is used
Ensure parameter sharing
Handles the temporal dependency between
sequence
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Recurrent Neural Network Architectures
one to one
Vanilla NN
Vanilla NN Image
Captioning
Source: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn
Source: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn
Wxh
x1 x2 x3 x4
Whh
h0
Wxh
x1 x2 x3 x4
Activation Bias
Whh function
h0 h1
Wxh
x1 x2 x3 x4
s1
Wsh
Whh
h0 h1
Wxh
x1 x2 x3 x4
s1 s2
Wsh Wsh
Whh Whh
h0 h1 h2
Wxh Wxh
x1 x2 x3 x4
s1 s2 s3
Wsh Wsh Wsh
Whh Whh Whh
h0 h1 h2 h3
Wxh Wxh Wxh
x1 x2 x3 x4
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
sh sh
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step Gradient calculation wrt :
4 4
L1 L2 L3 L4
4
hh 4 hh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
hh
Whh Whh Whh Whh
h0 h1 h2 h3 h4 Now,
Wxh Wxh Wxh Wxh
Simply,
z4
Then, z4
x1 x2 x3 x4 hh hh
h3
h3
hh
Assumptions:
h2
and = least square function = h3 h2
hh
Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step Gradient calculation wrt :
4 4
L1 L2 L3 L4
4
xh 4 xh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
sh
xh
Whh Whh Whh Whh
h0 h1 h2 h3 h4 Now,
Wxh Wxh Wxh Wxh Simply,
z4
x1 x2 x3 x4 Then, z4
xh xh
Whh.h3
x4
Assumptions: xh
z2
and = least square function = x4 Whh z2
xh
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
z2
sh x4 Whh z2
Whh Whh Whh Whh xh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
z2
sh x4 Whh z2
Whh Whh Whh Whh xh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
tf.keras.layers.SimpleRNN(rnn_units)
Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Limitations of RNN
Gradient calculation involves many factors of weights and contribution of
activation function.
Gradient clipping
h1 h2 h3 hn
Image source:https://prvnk10.medium.com/the-whiteboard-analogy-to-deal-vanishing-and-exploding-gradients-1c0d47bfd6e1
Selectively write
Selectively write
Selectively read
Selectively write
Selectively read
Selectively forget
Compute
Say “board” can have only 3 statements at a Selectively write
time.
Compute
Say “board” can have only 3 statements at a Selectively write
time.
𝑎𝑐 = 17
Compute
Say “board” can have only 3 statements at a Selectively write
time.
𝑎𝑐 = 17
𝑏𝑑 = 50
Compute
Say “board” can have only 3 statements at a Selectively read
time.
𝑎𝑐 = 17
𝑏𝑑 = 50
Compute
Say “board” can have only 3 statements at a Selectively read
time.
𝑎𝑐 = 17
𝑏𝑑 = 50
𝑏𝑑 + 𝑎 = 52
Compute
Say “board” can have only 3 statements at a Selectively forget
time.
𝑎𝑐 = 17
𝑏𝑑 = 50
𝑏𝑑 + 𝑎 = 52
Compute
Say “board” can have only 3 statements at a Selectively forget
time.
𝑎𝑐 = 17
𝑏𝑑 + 𝑎 = 52
Compute
Say “board” can have only 3 statements at a Selectively forget
time.
𝑎𝑐 = 17
𝑎𝑐(𝑏𝑑 + 𝑎) = 884
𝑏𝑑 + 𝑎 = 52
Compute
Say “board” can have only 3 statements at a Selectively forget
time.
𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728
𝑎𝑑 = 20
Compute
Say “board” can have only 3 statements at a Selectively forget
time.
𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728
𝑎𝑑 = 20
Compute
Say “board” can have only 3 statements at a Selectively forget
time.
x1 x2 x3 xn
The First ... performance
Review: The first half of the movie was dry but the
second half really picked up pace. The lead actor
delivered an amazing performance.
Use Gates xn
Source: https://medium.com/analytics-vidhya/tagged/lstm
sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
b) Input in tanh
×
gn ×
c) Update σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
b) Input in tanh
×
gn ×
c) Update σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
in Ci tanh
b) Input ×
gn ×
c) Update σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 × Cf + Cn
b) Input Ci tanh
× ×
c) Update
σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 × Cf + Cn
b) Input Ci tanh
× ×
c) Update
σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1
sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1 hn
× + × + × +
tanh tanh tanh C3
C0 fn fn fn
× × × × × ×
σ σ tanh σ σ σ tanh σ σ σ tanh σ
x1 x2 x3