You are on page 1of 863

Deep learning

Loss Function, Gradient


Decent Algorithm,
Backpropagation
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Loss Function
• Compares the target and predicted output values to measures how well the neural
network models the training data.
• The aim is to minimize this loss between predicted & target outputs.
• Majorly 2 types of loss function
Note ( Loss function
vs cost function)
Loss Cost
Function Function
Is loss for Is the Regression loss Classification Loss
a single average
training loss over
example/ the entire
• MSE (Mean square error ) • Binary cross-entropy
input training • MAE (Mean absolute error) • Categorical cross-entropy
dataset.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mean Squared Error (MSE)

• MSE finds the average of squared differences b/w


the target and predicted outputs
• The difference is squared, which means it does not
matter whether the predicted value is above or
below the target value; however, values with a
large error are penalized.
• MSE is also a convex function with a clearly defined
global minimum.
• This allows to more easily utilize gradient descent
optimization to set the weight values.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mean Absolute Error (MAE)
• MAE finds the average of the absolute
differences between the target and the
predicted outputs.

• As MSE is highly sensitive to outliers,


which can dramatically affect the loss
because the distance is squared. MAE
is used in cases when the training data
has a large number of outliers to
mitigate this.
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Binary cross-entropy/Log Loss
• Binary cross entropy compares each
of the predicted probabilities to the
actual class output which can be
either 0 or 1.
• It then calculates the score that
penalizes the probabilities based on
the distance from the expected
High Low Low High value. That means how close or far
penalty penalty penalty penalty from the actual value.
• Advantage –A cost function is a
differential.
• Disadvantage –Multiple local
minima, Not intuitive

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Categorical cross-entropy
• Also called Softmax Loss. It is a
One-hot encoding Softmax activation plus a Cross-
Entropy loss.
• It is used for multi-class classification.
• In the specific (and usual) case of
Multi-Class classification the labels are
one-hot encoded.
• Sparse Categorical Cross Entropy Loss
Function: ​
• Used when number of classes is
too large (eg 1000)​
• Avoid one-hot encoding, which
requires large memory​

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• We want to find the network weights that
achieve the lowest loss and those weights
∗ 1 can be used for its prediction.
W = argmin σ𝑛𝑖=1 𝐿(𝑓 𝑥𝑖 ; 𝑾 , 𝑠𝑖) • Here W is the set of weights, we need to
𝑛
W
find the optimal set of weights that tries to
W*= argmin J(W) minimize the loss over our entire test set.
W • Test set is the data set that we want to
evaluate or model.
• Argmin = argument of minimum is used to
get the minimum weight. Where W is the
collection or set of all the weights

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W)
w
Remember : Our Loss function is just a
simple function in terms of those
weights.
If we plot a 2 Dimensional

• Weights are on x and y axis whereas


loss is marked on z axis.
• For any value of w, we can see the loss
would be at that point.
• We need to find the point on this
landscape i.e. what are the values of W
that has minimum loss.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W)
w
• Randomly pick a place on this
landscape to start finding the
minimum weights.
• Form this random place we find
how this landscape is changing,
how the slope of landscape is
changes using gradient of the loss
with respect to each of the weights.
• The gradient is a vector which gives
us the direction in which loss
function has the steepest ascent.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W) • Gradient tell us which way to move
w
to find the steepest landscape using
function:
𝝏𝑱(𝑾)
𝝏𝑾
• Here we can see the higher
landscape with respect to the
selected point so we need to take
step in direction that’s lower than
the selected point.
We can take the gradient of the loss
with respect to each of these weights
to understand the direction of
maximum ascent.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W)
w

• Take small step in opposite direction


of gradient.
• On getting the lower point, the
process need to be repeated over
and over again until we converged
to a local minimum point.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
• Repeat until Convergence

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
Algorithm for gradient descent:
1. Initialize the weights randomly ~N(0,𝝈 𝟐) weights = tf.random_normal( )

2. Loop until finding the convergence:


𝝏𝑱(𝒘)
3. Compute gradient, grads = tf.gradients(loss, weights)
𝝏𝒘
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾
5. Return weights
Now to summarize the algorithm which is known as gradient descent – taking a gradient and descending down the
landscape by initializing the weights randomly we compute the gradient d(J) with respect to all of our weights then we
update our weights in opposite direction of that gradient and we take a small step which we call here eta of that gradient
and this is referred to as Learning rate. eta is a scalar number – that indicates how much step you want to take at each
iteration – how strongly or aggressively you want to step towards that gradient.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
Algorithm for gradient descent:
1. Initialize the weights randomly ~N(0,𝝈 𝟐 ) weights = tf.random_normal( )

2. Loop until finding the convergence:


𝜕𝐽(𝒘)
3. Compute gradient, 𝜕𝒘
grads = tf.gradients(loss, weights)

𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾

5. Return weights
• The amount that the weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
• The magic line here is actually how to you compute that gradient – that’s something not easy at all. So the question is
given a loss given all of our weights in our network how do we know which way is good –which way is a good place to
move. - That’s a process called back-propagation. We will discuss back propagation using elementary calculus.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent

Algorithm for gradient descent:


1. Initialize the weights randomly ~N(0,𝝈 𝟐)
2. Loop until finding the convergence:
𝝏𝑱(𝒘)
3. Compute gradient,
𝝏𝒘
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
5. Return weights Can be very
computationally
intensive to
compute!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stochastic Gradient Descent

Algorithm for gradient descent:


1. Initialize the weights randomly ~N(0,𝝈 𝟐)
2. Loop until finding the convergence:
3. Pick a single data point i,
𝝏𝑱𝒊(𝒘)
4. Compute gradient,
𝝏𝒘
𝝏𝑱(𝒘)
5. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
Easy to compute but
6. Return weights very noisy
(stochastic)!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stochastic Gradient Descent with momentum
• SGD is noisy & requires more iteration to
reach minima. Adding a momentum term to
regular SGD for faster convergence of loss
function.
• SGD oscillates between either direction of
gradient & updates the weights accordingly.
By adding a fraction of the previous update
to the current update will make the process
a bit faster. velocity v denote
• Updated weight, Wt+1 = 𝑾𝒕 − 𝑽𝒕 the change in the
𝝏𝑱(𝒘) gradient to reach the
𝑽𝒕= β Vt-1 + ƞ global minima.
𝝏𝑾
• learning rate should be decreased wit
momentum term.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batch Gradient Descent

Algorithm for gradient descent:


1. Initialize the weights randomly ~N(0,𝝈𝟐)
2. Loop until finding the convergence:

3. Pick batch of B data points


𝝏𝑱(𝒘) 𝟏 𝝏𝑱(𝒘)
4. Compute gradient, = σ𝑩
𝒌=𝟏
𝝏𝒘 𝑩 𝝏𝒘
𝝏𝑱(𝒘)
5. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
Fast to compute and a
6. Return weights much better estimate of
the true gradient!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batches while training
• Mini-batch gradient descent is a variation of the gradient
descent algorithm that splits the training dataset into small
batches that are used to calculate model error and update
model coefficients.
• Mini-batch gradient descent seeks to find a balance between
the robustness of stochastic gradient descent and the efficiency
of batch gradient descent.
• More accurate estimation of gradient
• Smoother convergence Allows for larger learning
rates

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batches while training
Summary Points
More accurate • Because of this estimation allows us to converge towards the target much quicker also it
means that gradients are more accurate in practice.
estimation of gradient • If it is quite noisy in our learning estimation we probably can increase our learning rate more
so that we don’t fully step in a wrong direction –if we are not confident with the gradient.
Smoother convergence • If we have a larger batch with more data to estimate our gradients with we can trust that
learning rate a little more and increase its steps more aggressively in that direction.
Allows for larger So we can finally Summarize:
learning rates • Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the
training dataset into small batches that are used to calculate model error and update model
Mini-batches lead to coefficients.
• Mini-batch gradient descent seeks to find a balance between the robustness of stochastic
fast training gradient descent and the efficiency of batch gradient descent.
Increase the
computation and Which this also means that we can massively parallelize the
computation because we can split up batches on multiple
achieve increased GPUs or multiple computers even to achieve more significant
speed on GPU’s speed ups with this training process.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary
• Batch Gradient Descent (BGD):
It uses the entire dataset at every step, making it slow for large datasets.
However, it is computationally efficient, since it produces a stable error gradient and a
stable convergence
• Stochastic Gradient Descent (SGD):
It is on the other extreme of the idea, using a single example (batch of 1) per each
learning step. Much faster, may return noisy gradients which can cause the error rate to
jump around
• Mini Batch Gradient Descent:
Computes the gradients on small random sets of instances called mini batches.
Reduce noise from SGD and still more efficient than BGD

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Backpropagation Algorithm
• The algorithm is used to effectively
train a neural network through a
method called chain rule.

• After each forward pass through a


network, backpropagation
performs a backward pass while
adjusting the model’s parameters
(weights and biases).

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Backpropagation aims to
minimize the cost function by
adjusting network’s weights and
biases.

• The level of adjustment is


determined by the gradients of the
cost function with respect to those
parameters.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Gradient of a function C(x1, x2, …, xm) in point x is a
vector of the partial derivatives of C in x.
• Equation for derivative of C in x

• Function C derivative measures the sensitivity to


change of function value (output value) with respect to
a change in its argument x (input value) or the
derivative tells us the direction C is going.
• The gradient shows how much the parameter x needs
to change (in positive or negative direction) to
minimize C.
• Compute those gradients happens using a technique
called chain rule.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)

How does a small change in one weight (ex. W2) affect the final loss J (W)?

• This is a Simple network with one input layer, one hidden layer (one hidden neuron) and one output layer,
the simplest neural network you can create.
• Computing the gradient of loss of W with respect to w2 ( that is between hidden state and W) can perform
lot of changes in loss value. We actually want to see - How does a small change in one weight (ex. W2)
affect the final loss J (W)?
• So this derivative is going to tell us how much a small change in this weight will affect our loss if we make
a small change in the weight, in one Direction will it increase our loss or decrease our loss.
• Like how a small change in w2 – makes how much change – up or down – how does it change – and by
how much really !

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)

𝜕𝐽(𝑾)
Gradient loss of W with respect to w2
𝜕 𝒘𝟐

Let’s use the chain rule!

So, to compute that we can use this derivative, we can start with applying the chain
rule backwards from the loss function through the output specifically.
So that’s the gradient we care about – The gradient of our loss with respect to w2.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)

Split this into the gradient of our loss 𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ


= *
with respect to output. 𝜕 ( 𝑤 2) 𝜕 𝑠ǁ 𝜕(𝑤 2 )

• We can do is we can actually just decompose this derivative into two components the first component
• To evaluate this we can use the chain rule in elementary calculus.
• We can split them into gradient of the loss with respect to our output multiplied by gradient of output s with
respect to w2.
• The derivative of the loss with respect to our output multiplied by the derivative of our output with respect
to W2, this is just a standard use of the chain rule with this original derivative that we had on the left hand
side
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)

Now if we want to repeat this process with a


different weight – say w1.
𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ
*
Replace w2 with w1 and apply chain rule but we 𝜕( 𝑤1) = 𝜕 𝑠ǁ 𝜕 (𝑤 1 )
now notice that the gradient of output s with
respect to w1 is not directly computable , apply
chain rule again to evaluate.
Apply chain rule! Apply chain rule!
Here replace W2 with W1 and that chain Rule still
holds right that same equation holds.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)
• So, lets apply the chain rule, and split with respect to Z.
• This way - Back propagation is done performing it, all the
gradients from the output to all the way back to the input that 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1
allow the error to propagate from output layer to input layer 𝜕 ( 𝑤 1) = * *
𝜕 𝑠ǁ 𝜕(𝑍 1 ) 𝜕𝑤 1
and allows these gradients to be computed in practice.
• There is lot of deep popular neural networks that performs
automatic differentiation which does all of these back
propagation.

Here it can be seen on the red component that last component of the chain rule, we have to once again recursively apply one more chain
rule because that's again another derivative that we can't directly evaluate. We can expand that once more with another instantiation of
the chain Rule and now all of these components.
We can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're
interested. In in this example we first computed the derivative with respect to W2 then we back propagated and used that information also
with W1. That's why we really call it back propagation because this process occurs from the output all the way back to the input

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)

Repeat this for every weight in the 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1


𝜕 ( 𝑤 1) = * *
network using gradients from later layers 𝜕 𝑠ǁ 𝜕(𝑍 1 ) 𝜕𝑤 1

Repeat this process essentially many times over the course of training by back-propagating.
These gradients over and over again through the network all the way from the output to the inputs to
determine for every single weight answering this question which is how much does a small change in these
weights affect our loss function if it increases it or decreases and how we can use that to improve the loss
ultimately because that's our final goal so that's the back propagation algorithm that's the core of training
neural networks.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Batch Normalization
Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization vs. Standardization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Covariant Shift
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep
Neural
Networks
(alternate
Explaination:
Bias Variance
Trade-off)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias
Bias: The difference between the prediction of the values by the Machine Learning model and the
correct value.

High Bias Large error in training as well as testing data

Hypothesis is too simple or linear in nature

The data predicted is in a straight line format, thus


not fitting accurately in the data in the data set.

High Bias in the Model


Underfitting

Example,

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Variance
Variance: The variability of model prediction for a given data point which tells us the spread of the
data .

High Variance Very complex fit to the training data

Not able to fit accurately on the data which it


hasn’t seen before (Test Data)

Models perform very well on training data but have


high error rates on test data
High Variance in the Model
Overfitting

Example,
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Bias Variance Tradeoff
Bias and variance typically trade off in relation to model complexity

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias Variance Tradeoff
If the algorithm is too
High bias and
simple (hypothesis
Low variance
with linear equation)
condition

If algorithms fit too


High variance
complex (hypothesis with
and low bias.
high degree equation)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias Variance Tradeoff
An algorithm can’t be more complex and less complex at the same time.

To optimize the value of the total error for the model by using
the Bias-Variance Tradeoff:

The best fit will be given by the hypothesis on the tradeoff point.

This is referred to as the best point chosen for the training of


the algorithm which gives low error in training as well as
testing data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Bias Variance Tradeoff
1.High Bias and High Variance
(The Worst-Case Scenario)

2.Low Bias and Low Variance


(The Best-Case Scenario)

3. Low Bias and High Variance


(Overfitting)

4. High Bias and Low Variance


(Underfitting)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Exponential Moving Average
Task: Approximating a given parameter that changes in time where,
we are aware of all of its previous values. The objective is to predict
the next value which depends on the previous values.

One possible strategy: Take the average of the last several values.
This might work in certain cases but it is not very suitable for scenarios
when a parameter is more dependent on the most recent values.

Second possible strategy: To distribute higher weights to more recent


values and assign fewer weights to prior values.

Exponential Moving Average

It is based on the assumption that more recent values of a variable contribute more to the formation
of the next value than precedent values.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average

•vₜ is a time series that approximates a given


variable. Its index t corresponds to the
timestamp t.
•The value v₀ for the initial timestamp t = 0 is
usually taken as 0.
•θ is the observation on the current
iteration.
•β is a hyperparameter between 0 and 1
which defines how weight importance
should be distributed between a previous
average value vₜ-₁ and the current
observation θ

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average

Exponential moving average for the t-th timestamp

• The most recent observation θ has a weight of


1, the second last observation — β, the third last
— β², etc.
• Since 0 < β < 1, the multiplication term βᵏ goes
exponentially down with the increase of k, so
the older the observations, the less important In practice, the value for β is usually chosen close to 0.9.
they are.
• Finally, every sum term is multiplied by (1 —β).

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average By using this equation, for a chosen value of β, we can
compute an approximate number of timestamps t it takes for
weight terms to reach the value of 1 / e ≈ 0.368).
Mathematical Interpretation

The famous
second
wonderful limit • Taking β = 0.9 indicates that
approximately in t = 10 iterations, the
By making a weight decays to 1 / e, compared to the
substitution weight of the current observation.
β=1-x • In other words, the exponential
weighted average mostly depends only
on the last t = 10 observations.
As in the equation for the exponential moving
average, every observation value is multiplied by a
term βᵗ . Then on comparing both forms:

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average
Bias correction
• The common
problem with using
exponential
weighted average is
that in most
problems it cannot
approximate well
the first series
values. Case 1: v₀ = 0 Case 2: v₀ = value of first observation θ₁
• It occurs due to the Though this approach works well in some situations, it is still not
Then the first several values will
absence of a perfect, especially in cases when a given sequence is volatile. For
put a large weight on v₀ which is 0
sufficient amount example, if θ₂ differs too much from θ₁
whereas most of the points on
of data on the first
the scatterplot are above 20.
iterations.
Imprecise Approximation It will also result in poor Approximation for volatile data

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exponential Moving Average
Bias correction
• The solution is to use a
technique called “bias
correction”.
• Instead of simply using
computed values vₖ, they are
divided by (1 —βᵏ). Assuming
that β is chosen close to 0.9–1,
this expression tends to be
close to 0 for first iterations
where k is small.
• Thus, instead of slowly
accumulating the first several
values where v₀ = 0, they are
now divided by a relatively
small number scaling them into
larger values.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Exponential Moving Average
Bias correction

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Gradient Descent : Representation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient descent
Gradient descent is the simplest optimization
algorithm which computes gradients of loss function
with respect to model weights and updates them. Gradient descent equation

w is the weight vector,


dw is the gradient of w,
α is the learning rate,
t is the iteration number

Optimization problem with gradient descent in a ravine area.


Blue: starting point
Black: Local minimum area where the surface is much more
steep in one dimension than in another
Courtesy: towardsdatascience
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Gradient descent

• In this example, the starting point and the local minima have different horizontal coordinates and are almost equal vertical
coordinates.
• Using gradient descent to find the local minima will likely make the loss function slowly oscillate towards vertical axes.
• These bounces occur because gradient descent does not store any history about its previous gradients making
gradient steps more undeterministic on each iteration.
Thus, large learning rate  disconvergence.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need better optimization algorithms?
• In practice during Gradient Decent
technique can run into certain problems
during training that can slow down the
learning process or, in the worst case,
even prevent the optimal weights from
being found.
• These problems are, on the one hand,
so-called saddle points and, on the
other hand, local minima of the loss local minima Saddle point
function. At the saddle points and the
local minima the loss function becomes
flat and the gradient at this point goes
towards zero.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent

• A gradient close to zero in a saddle


point or in a local minimum does
not improve the weight parameters
and prevents the whole learning
process.
• results in a zig-zag motion towards
the optimal weights and can slow
down learning a lot

Gradient Descent
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
It would be desirable to make a loss function performing larger
steps in the horizontal direction and smaller steps in the
vertical.
Momentum uses a pair of
equations at each iteration:

Exponentially moving
average for gradient
values dw The momentum term increases for dimensions
Normal gradient descent whose gradients point in the same directions
update using the computed and reduces updates for dimensions whose
moving average value on the gradients change directions. As a result, we
current iteration. gain faster convergence and reduced oscillation
(An overview of gradient descent optimization
algorithms∗ Sebastian Ruder)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum
Instead of simply using them for updating weights, we take several Momentum usually converges
past values and literaturally perform update in the averaged direction. much faster than gradient
descent. With Momentum,
there are also fewer risks in
using larger learning rates,
thus accelerating the training
process.
Optimization
with Momentum

In Momentum, it is
recommended to choose
β close to 0.9.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Momentum
Momentum technique is an
approach which provides an
update rule that is motivated from
the physical perspective of
optimization. Imagine a ball in a
hilly terrain is trying to reach the
deepest valley. When the slope of
the hill is very high, the ball gains a
lot of momentum and is able to
pass through slight hills in its way.
As the slope decreases the
momentum and speed of the ball
decreases, eventually coming to
rest in the deepest position of
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a valley.
global minimum (the left well) and local minimum (the right well)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Momentum • In general, velocity can be seen to increase
with time. By using the momentum term,
saddle points and local minima become less
dangerous for the gradient. This is because the
step size toward the global minimum now
depends not only on the slope of the loss
function at the current point, but also on the
velocity that has built up over time.

The advantage of momentum is that it


makes very small change to SGD but
provides a big boost to speed of learning.
We need to store the velocity for all the
parameters, and use this velocity for
SGD (black) vs. SGD with momentum (blue) making the updates.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Nesterov Accelerated Gradient
Momentum may be a good method but if the momentum is too high the
algorithm may miss the local minima and may continue to rise up. So, to resolve
this issue the NAG algorithm was developed. It is a look ahead method.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient
Nesterov Accelerated Gradient is a momentum-based SGD optimizer that
"looks ahead" to where the parameters will be to calculate the gradient ex post
rather than ex ante:

projected
gradient
V initialised to 0

Like SGD with momentum (β) is usually set to 0.9.


The projected gradient value can be obtained by going ‘one step ahead’ using the previous velocity. This
means that for this time step t, there need to carry out another forward propagation before executing the
backpropagation.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Nesterov Accelerated Gradient
Steps:
1.Update the current weight wt to a projected weight w* using the
previous velocity.

Carry out forward propagation, but using this projected weight.

3.Obtain the projected gradient ∂L/∂w*.

4.Compute Vt and wt+1 accordingly.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Nesterov Accelerated Gradient

The intuition is that the standard momentum method first computes the
gradient at the current location and then takes a big jump in the direction of the
updated accumulated gradient. In contrast Nesterov momentum first makes a
big jump in the direction of the previous accumulated gradient and then
measures the gradient where it ends up and makes a correction. The idea being
that it is better to correct a mistake after you have made it.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)
(to adapt the learning rate to computed gradient values.)
• There might occur situations
when during training, one Adagrad accumulates element-wise squares dw² of gradients
component of the weight from all previous iterations.
vector has very large gradient
values while another one has
extremely small. During weight update, instead of using normal learning rate α,
• This happens especially in AdaGrad scales it by dividing α by the square root of the
cases when an infrequent accumulated gradients √vₜ.
model parameter appears to
have a low influence on
predictions.
• The same problem can occur
with sparse data where there a small positive term ε is added to
is too little information about the denominator to prevent
potential division by zero.
certain features

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)
Advantage:
The greatest advantage of AdaGrad is that
there is no longer a need to manually adjust
the learning rate as it adapts itself during
training.

• AdaGrad deals with the aforementioned


problem by independently adapting the learning
rate for each weight component.
• If gradients corresponding to a certain weight
vector component are large, then the respective
learning rate will be small.
• Inversely, for smaller gradients, the learning rate
will be bigger. This way, Adagrad deals with
vanishing and exploding gradient problems.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)
Disadvantage:
• The learning rate constantly
decays with the increase of
iterations (the learning rate is
always divided by a positive
cumulative number).
Therefore, the algorithm
tends to converge slowly
during the last iterations
where it becomes very low.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AdaGrad (Adaptive Gradient Algorithm)

AdaGrad (white) vs. gradient descent (cyan) on a terrain with a saddle point. The learning rate of AdaGrad is set to be
higher than that of gradient descent, but the point that AdaGrad’s path is straighter stays largely true regardless of
learning rate.This property allows AdaGrad (and other similar
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AdaGrad (Adaptive Gradient Algorithm)
From the animation, it can
be seen that Adagrad
might converge slower
compared to other
methods. This could be
because the accumulated
gradient in the
denominator causes the
learning rate to shrink and
become very small,
thereby slowing down the
learning over time.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Issue with a squared gradient for vₜ :
• Transformation equations when using a squared gradient:

last square
gradient at
every iteration

•If dw > 0, then a weight w is decreased by α.


•If dw < 0, then a weight w is increased by α.
• Thus, if vₜ = dw², then model weights can only be changed by ±α.
• Though this approach works sometimes, it is still not flexible the algorithm becomes
extremely sensitive to the choice of α and absolute magnitudes of gradient are ignored
which can make the method tremendously slow to converge.

• A little positive aspect about this algorithm is the fact only a single bit is required to
store signs of gradients which can be handy in distributed computations with strict
memory requirements.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSProp (Root Mean Square Propagation)
RMSProp was elaborated as an improvement over AdaGrad which tackles the
issue of learning rate decay. exponentially moving average
• However, instead of storing a cumulated sum of squared
gradients dw² for vₜ, the exponentially moving average is
calculated for squared gradients dw².

• Experiments show that RMSProp generally converges faster


than AdaGrad because, with the exponentially moving
average, it puts more emphasis on recent gradient values
rather than equally distributing importance between all
gradients by simply accumulating them from the first iteration.

• Furthermore, compared to AdaGrad, the learning rate in


RMSProp does not always decay with the increase of iterations
making it possible to adapt better in particular situations.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RMSProp (Root Mean Square Propagation)

In RMSProp, it is recommended to choose β close to 1.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RMSProp (Root Mean Square Propagation)

RMSProp (green) vs AdaGrad (white). The first run just shows the balls; the second run also shows the
sum of gradient squared represented by the squares.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)
• Adam is the most famous optimization algorithm in deep learning.
• Adam combines Momentum and RMSProp algorithms. To achieve it, it simply keeps
track of the exponentially moving averages for computed gradients and squared
gradients respectively.
• Furthermore, it is possible to use bias correction for moving averages for a more
precise approximation of gradient trend during the first several iterations.
• The experiments show that Adam adapts well to almost any type of neural network
architecture taking the advantages of both Momentum and RMSProp.

first
momentum.

Updated weight
Second momentum.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adam (Adaptive Moment Estimation)

According to the Adam paper (https://arxiv.org/pdf/1412.6980.pdf), good default values for


hyperparameters are β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Role of first moment and second moment play in adaptively
adjusting the learning rate
First Moment:
•Also known as the mean squared gradient, it represents the exponentially
decaying average of past gradients for each parameter.
•Imagine it as a "moving average" of how steeply the loss function changes in the
direction of each parameter.
•This helps to track the overall trend of the gradient, preventing Adam from being
overly affected by sudden spikes or fluctuations.
•Its contribution is to provide a smoother and more stable direction for updating
the weights compared to using just the current gradient.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Role of first moment and second moment play in adaptively
adjusting the learning rate
Second Moment:
•Also known as the RMSprop squared gradient , it represents the exponentially decaying
average of squared past gradients for each parameter.
•Think of it as a measure of how "jumpy" or volatile the recent changes in the
gradient have been for each parameter.
•If the second moment is high, it indicates significant fluctuations, and Adam reduces the
learning rate for that parameter, preventing it from overshooting the minimum loss.
•Conversely, a low second moment suggests consistent improvement, and Adam allows a
faster learning rate for that parameter.
•The contribution of the second moment is to dynamically adjust the learning rate for
each parameter, preventing overshooting and allowing faster convergence in areas with
smoother changes.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Steps Involved in the Adam Optimization Algorithm
1. Initialize the first and second moments’ moving averages (v and s) to zero.
2. Compute the gradient of the loss function to the model parameters.
3. Update the moving averages using exponentially decaying averages. This involves
calculating vt and st as weighted averages of the previous moments and the
current gradient.
4. Apply bias correction to the moving averages, particularly during the early
iterations.
5. Calculate the parameter update by dividing the bias-corrected first moment by the
square root of the bias-corrected second moment, with an added small constant
(epsilon) for numerical stability.
6. Update the model parameters using the calculated updates.
7. Repeat steps 2-6 for a specified number of iterations or until convergence.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Advantage
It tends to focus on faster computation time, whereas algorithms like stochastic
gradient descent focus on data points. That’s why algorithms like SGD
generalize the data in a better manner at the cost of low computation speed.
So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.

Disadvantage
It doesn’t focus on data points rather focus on computation time

Note: So, the optimization algorithms can be picked accordingly depending on the
requirements and the type of data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Visualizations of various optimization algorithms.

Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary- Optimizers

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 1
Loss Function

Amity Centre for Artificial Intelligence, Amity University, Noida, India


“Visualizing the loss
landscape of neural
nets”. Dec 2017.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

W ←W − ƞ

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

W ←W − ƞ

• Learning rate for training the network.


• It has a high impact in performance of the model.
• How can we set the learning rate?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Setting the Learning Rate
• Setting smaller learning rate means not trusting the gradient.
• Small learning rate converges slowly and gets stuck in false local minima.

J(W)

Initial guess

W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Large learning rates overshoot, become unstable and diverge which is more undesirable.

J(W)

Initial guess

W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate
• Setting learning rate is very challenging.
• Stable learning rates converge smoothly and avoid local minima

J(W)

Initial guess

W
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Setting the Learning Rate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


How to deal with setting learning rate?

Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly

Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

Amity Centre for Artificial Intelligence, Amity University, Noida, India


How to deal with setting learning rate?

Idea 1:
Hit and trial Method: Trying different learning rates and see what
works correctly

Idea 2:
Do something smarter!
Design an adaptive learning rate: Which "adapts” to the
landscape

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adaptive Learning Rates

• Learning rates are no longer fixed


• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Summary
• Loss function: Compares the target and predicted output values to measures
how well the neural network models the training data.
• Types of Loss Function:
• Regression loss
• Classification loss
• Learning rate: is a hyper-parameter used to govern the pace at which an
algorithm updates or learns the values of a parameter estimate.
• Setting an adaptive learning rate is a better solution to fixed learning rate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adaptive Learning Rates
Algorithm Tensorflow implementation
• Adam
• Adadelta
• Adagrad
• RMSProp

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adagrad (Adaptive Gradient Descent)
• In this change in learning rate depends upon the difference in parameters
during training. The more the parameters get changed, the more minor the
learning rate changes. The formula to update the weights.
𝝏𝑱 𝝎
𝒕 𝟏= 𝒕 𝒕 𝝏𝝎

constant

small positive to
different learning avoid division by 0
rates at each iteration

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adagrad (Adaptive Gradient Descent)

• Advantage: It abolishes the need to modify the learning rate manually. it


reaches convergence at a higher speed.

• Disadvantage: It decreases the learning rate aggressively and monotonically.


There might be a point when the learning rate becomes extremely small,
because the squared gradients in denominator keep accumulating, and thus
the denominator increasing. Due to small learning rates, the model
eventually becomes unable to acquire more knowledge, thus, accuracy of
the model is compromised.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RMSprop (Root mean square propogation)
• It uses sign of the gradient, adapting the step size (momentum) individually
for each weight.
• Two gradients are first compared for signs. For same sign- going in right
direction - Increase the step size by a small fraction. For opposite signs -
decrease the step size.
• The algorithm keeps the moving average of squared gradients for every
weight and divides the gradient by the square root of the mean square.
𝒏 𝝏𝑱 𝝎
𝑾𝒕 𝟏 = 𝒕 𝒗 𝒘,𝒕 𝝏𝝎

𝝏𝑱 𝝎
𝒗 𝒘, 𝒕 + 𝟏 = 𝜸 𝒗 𝒘, 𝒕 + (1- 𝜸) ( )
𝝏𝝎

Momentum or
forgetting factor,
usually 0.9
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RMSprop (Root mean square propogation)

• Advantage:
It reduces monotonical decrease in learning rate as in
AdaGrad.

• Disadvantage: It doesn’t work well with large datasets but


with mini-batches of data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Adadelta

• AdaDelta is a stochastic optimization technique that allows for


per-dimension learning rate method for SGD.

• It is an extension of Adagrad that seeks to reduce its aggressive,


monotonically decreasing learning rate.

• Instead of accumulating all past squared gradients, Adadelta


restricts the window of accumulated past gradients to a fixed size
w.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Adadelta

Amity Centre for Artificial Intelligence, Amity University, Noida, India


m.

Adam (Adaptive moment estimation)


• Adam optimizer updates the learning rate for each network weight individually.
• The first moment is mean, and the second moment is uncentered variance (meaning we
don’t subtract the mean during variance calculation).
𝝏𝑱 𝝎
• 𝒕 𝟏 𝒕 𝟏 + (1- 𝟏) 𝝏𝝎 Bias corrected
𝝏𝑱 𝝎 estimators for
• 𝒕 𝟐 𝒕 -1 + (1- 𝟐) 𝝏𝝎 the first and
second
moments.

• mt and vt initialized as 0,it is observed that they gain a tendency to be ‘biased towards 0’ as
both β1 & β2 ≈ 1. fixes this problem by computing ‘bias-corrected’ mt and vt. This control
the weights while reaching the global minimum to prevent high oscillations when near it.
• Algorithm has a faster running time, low memory requirements, and requires less tuning
than any other optimization algorithm.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Visualizations of various optimization algorithms.

Source: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/#Adagrad_(Adaptive_Gradient_Descent)_Deep_Learning_Optimizer
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 3
Overfitting and
underfitting bias variance trade
off

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• The model is too simplistic and not
able to learn enough from the training
data

• Hence it reduces the accuracy and


produces unreliable predictions.

• How to avoid Underfitting?


• By increasing the training time of The model is unable to capture the data points
present in the plot.
the model.
• By increasing the number of Source:- https://www.javatpoint.com/overfitting-
features. and-underfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• The model is too simplistic and not able
to learn enough from the training data

• Hence it reduces the accuracy and


produces unreliable predictions.

• Reason for Underfitting?


• Data used for training is not cleaned
and contains noise (garbage values)
in it The model is unable to capture the data points
• The model has a high bias present in the plot.
• The size of the training dataset used
is not enough Source:- https://www.javatpoint.com/overfitting-
• The model is too simple and-underfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• When learning a model we have a set of data (training set)
that we use to learn the model parameters
• The evaluation of the model needs to happen out-of-sample,
i.e., on a different set that was not used for learning model
parameters
• One of the most common problems during training is tying
the model to the training set
– Overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• When a model is over fitted it is not expected to perform well
to new data
– It is not generalizable

• Overfitting occurs when the model chosen is too complex that


ends up describing the noise in the data instead of the trend
– E.g., too many parameters relative to the size of the training dataset
– An over fitted model memorizes the training instances and does not
learn the general trend in them

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Data used for training is not cleaned and contains noise
(garbage values) in it

•The model has a high variance

•The size of the training dataset used is not enough

•The model is too complex

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Bias of a Model: Underlying assumptions to make learning possible.
Simpler model=>More assumption=> High Bias

• Variance of a Model: Variability of model for given data points, Model


with high variance pays a lot of attention to training data, may end up
memorizing data rather than learning from it

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
• If we want to minimize MSE, we need to minimize both bias and variance
• However, when bias gets smaller, variance increases and vice versa
• A model that is underfitted has high bias
– Misses relevant relations between the independent variables and the
response variable
– Bias is reduced by increasing model complexity
• A model that is overfitted has high variance
• The model captures the noise in the training data instead of the trend
• Variance is reduced by decreasing model complexity

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Trading off goodness of fit against complexity of the model

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• The real aim of supervised learning is to do well on test data that is not known
during learning
• Choosing the values for the parameters that minimize the loss function on the
training data is not necessarily the best policy
• Generalization refers to How well the model trained on the training data
predicts the correct output for new instances
• We want the learning machine to model the true regularities in the data
and to ignore the noise in the data.
• But the learning machine does not know which regularities are real and
which are accidental quirks of the particular set of training examples we
happen to pick
• So how can we be sure that the machine will generalize correctly to new
data?
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Model Selection: Which model is best?

Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Simple model has
less parameters to be Complex model has more
learned parameters to be learned
(Low complexity, (High complexity,
low capacity) High capacity)

Model may Underfit, it may Model may Overfit, it may


not capture underlying trend start learning from noise
of the data and inaccurate data entries
Higher error for Lower error for
training data, may give training data, may give
high error for validation data higher error for validation
also data
High Bias, Low Variance Low Bias, High Variance

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 4
How to avoid overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Problem of overfitting

Source: https://www.javatpoint.com/overfitting-in-machine-learning
Amity Centre for Artificial Intelligence, Amity University, Noida, India
•Techniques to Avoid Overfitting
•Data Augmentation
•Regularization
•Drop-out
•Early-stopping
•Cross validation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Train with more data to avoid overfitting,
regularize the model
• Capturing and labeling of data is usually
expensive
• New data is generated from existing data,
with the help of
• Image rotations,
• Translation
• Blur, include noise
• Change brightness
• scaling
• flips (up down, left right)
• and so on
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Very Deep

Training set
Many neurons
Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Very Deep

Many neurons Regularization


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Slide source: Coding Lane

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Types of Regularization

Regularization

Ridge (L2) Lasso (L1) Elastic Net


Regularization Regularization Regularization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Cost function =
For linear regression line, let’s consider two
points that are on the line,

= Sum of the squared residuals


= Penalty for the errors
= slope of the curve/line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Cost function =
For linear regression line, let’s consider two 1.96
points that are on the line,

= 0 (considering the two points on the


line) Linear regression line
=1
= 1.4
Then, Cost function =
Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Cost function =
For ridge regression line, let’s assume, Ridge regression line

0.63
=
=1
= 0.7
Then, Cost function =

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge (L2)Regularization
It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients.

Comparing the two models, with all data points,


we can see that the Ridge regression line fits the Ridge regression line
1.96
model more accurately than the linear
0.63
regression line

Linear regression line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the absolute values of coefficients.

Cost function =
Here,
= Sum of the squared residuals
= Penalty for the errors
= Slope of the curve/line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the absolute values of coefficients.

Comparing the two models, with all data points,


we can see that the Lasso regression line fits the Lasso regression line
1.4
model more accurately than the linear 0.8
regression line

Linear regression line

Slide source: simplilearn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


It modifies the overfitted or underfitted models by adding the penalty
equivalent to the sum of the squares of the magnitude of coefficients
and sum of the absolute values of coefficients.

It is the combination of Ridge and Lasso regularization


Cost function =
Here,
= Sum of the squared residuals
= Penalty for the errors
= Slope of the curve/line

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Ridge Lasso Elastic Net
Useful when we have many Preferred when we are Preferred when we do not
variables with relatively fitting a linear model with know whether we want
smaller data samples fewer variables shrinkage or sparsity in the
parameter space.
Ridge will reduce the impact Lasso will eliminate many Elastic Net combines
of features that are not features, and reduce feature elimination from
important in predicting overfitting in the linear Lasso and feature
output values model. coefficient reduction from
the Ridge model to improve
the model predictions.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• During training, some number of nodes
are randomly ignored or “dropped out”
• During weight updation, the layer
configuration appears “new”
• Provides Regularization by avoiding co-
adaption between network layers to
correct mistakes from prior layers
• Improves generaliza on of the model
• Useful in Wider Networks to avoid
overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Stop training before we
have a chance to overfit
• Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing • Number of Iterations
(epochs) is a
Training
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
• Number of Iterations
Testing
(epochs) is a
Training hyperparameter
• Less epochs=>
Loss
Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training (epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing have a chance to overfit
Training • Number of Iterations
(epochs) is a
hyperparameter
• Less epochs=>
Loss Suboptimal solution
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
have a chance to overfit
Testing
• Number of Iterations
Training
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here!
(Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we
Testing
have a chance to overfit
Training
• Number of Iterations
Under-fitting Over-fitting
(epochs) is a
hyperparameter
• Less epochs=>
Loss Stop training Suboptimal solution
here! (Underfit)
• Too many epochs=>
Overfitting

Training Iterations
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Stop training before we have a chance
to overfit
• Number of Iterations (epochs) is a
hyperparameter
• Less epochs=> Suboptimal solution
(Underfit)
• Too many epochs=> Overfitting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• When data is plentiful, set aside a part
of training data as Validation Data->
Perform Model Selection
• Declare final result on Test Data
• Typical ratio for splitting into training,
validation, test data: 60:20:20

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• K-fold cross-validation
• When data is not
sufficient, split data in k
segments,
train with (k-1) segments,
validate with 1 segment
and iterate

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 5
Batch Normalization

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Normalization Batch Normalization

•Normalization is a procedure to •Batch normalization is a technique for


change the value of the numeric training very deep neural networks that
variable in the dataset to a typical normalizes the contributions to a layer
scale, without misshaping for every mini-batch. This has the impact
contrasts in the range of value. of settling the learning process and
drastically decreasing the number of
training epochs required to train deep
neural networks.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Normalization is a data pre-processing tool used to bring the numerical
data to a common scale without distorting its shape, to ensure that our
model can generalize appropriately.

• Batch normalization is a process to make neural networks faster and


more stable through adding extra layers in a deep neural network. The
new layer performs the standardizing and normalizing operations on
the input of a layer coming from a previous layer.

• The normalizing process in batch normalization takes place in batches,


not as a single input.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming from the pre-processing stage. When
the input passes through the first layer, it transforms, as a sigmoid function applied over the dot product of input X
and the weight matrix W.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Similarly, this transformation will take place for the second layer and go till the last layer L as shown in the following
image.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Although, our input X was normalized with time the output will no longer be on the same scale. As the data go through multiple
layers of the neural network and L activation functions are applied, it leads to an internal co-variate shift in the data.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Internal Covariate Shift is the change in the distribution of network
activations due to the change in network parameters during training

https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
Amity Centre for Artificial Intelligence, Amity University, Noida, India
if we stabilize the input values for
each layer (defined as z = Wx +
b, where z is the linear
transformation of the W
weights/parameters and the biases),
we can prevent our activation
Fig. From gradient it can be observed that
function from putting our input larger z , the function approaches zero, When
values into the max/minimum network’s nodes exist in this space, training
values of our activation function slows down significantly, since gradient values
decrease.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Transforming the data to have a mean zero and standard deviation one
• Calculate the mean and standard deviation of the hidden layer activation.

no. of neurons
at layer h

• Normalize the hidden activations by this subtracting the mean from each input and divide
the whole value with the sum of standard deviation and the smoothing term (ε).

• γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the
vector containing values from the previous operations.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


https://medium.com/analytics-vidhya/internal-covariate-shift-an-overview-of
how-to-speed-up-neural-network-training-3e2a3dcdd5cc
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Speed Up the Training
By Normalizing the hidden layer activation the Batch normalization speeds up
the training process.

• Handles internal covariate shift


It solves the problem of internal covariate shift. Through this, we ensure that
the input for every layer is distributed around the same mean and standard deviation.

• Smoothens the Loss Function


Batch normalization smoothens the loss function that in turn by optimizing the
model parameters improves the training speed of the model.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 2
Model parameter optimization
• Lecture 6
Hyperparameter tunning

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameters are defined as the parameters that are explicitly
defined by the user to control the learning process
• They are used to calculate model parameters, they are specific to
algorithm and can not be calculated from the data unlike
parameters
• It is selected and set by before the learning algorithm begins
training the model. Hence, these are external to the model, and
their values cannot be changed during the training process.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


•The k in kNN or K-Nearest Neighbour algorithm
•Learning rate for training a neural network
•Number of layers
•Number of nodes per layer
•Momentum
•Train-test split ratio
•Batch Size
•Number of Epochs
•Number of clusters in Clustering Algorithm
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Model Hyper
Model Parameter Parameter
•They are used by the model for •These are usually defined manually by
making predictions. the machine learning engineer.
•They are learned by the model
from the data itself •One cannot know the exact best value
•These are usually not set for hyperparameters for the given
manually. problem. The best value can be
•These are the part of the model determined either by trial and error
and key to a machine learning
Algorithm.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameter for Optimization
• Learning Rate
• Batch Size

• Hyperparameter for Specific Models


• Number of hidden units
• Number of layers
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• Hyperparameter tuning consists of finding a set of optimal
hyperparameter values for a learning algorithm while applying this
optimized algorithm to any data set

• It maximizes the model’s performance, minimizing a predefined loss


function to produce better results with fewer errors.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Some important hyperparameters that require tuning in neural
networks are:
• Number of hidden layers
• Number of nodes/neurons per layer
• Learning rate
• Momentum

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameters can be tunned either manually or can be
automated.
• Some automated hyperparameter tuning methods include:
• Grid search,
• Random search,
• Bayesian optimization.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Grid search is a sort of “brute force”
hyperparameter tuning method. A
grid of possible discrete
hyperparameter values fit the model
with every possible combination. The
model performance for each set is
recoded and select the combination
that has produced the best
performance.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• It chooses random values rather than
using a predefined set of values like
the grid search method.

• Tries a random combination of


hyperparameters in each iteration
and records the model performance.
After several iterations, it returns the
mix that produced the best result.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Grid Search. Random Search.

Grid and random search often evaluate many unsuitable


hyperparameter combinations.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• This method treats the search for
the optimal hyperparameters as
an optimization problem.
• When choosing the next
hyperparameter combination, this
method considers the previous
evaluation results and then
applies a probabilistic function to
select the combination that will
probably yield the best results

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Hyperparameters are the parameters that
are explicitly defined to control the
learning process before applying to a
learning algorithm.
• These are used to specify the learning
capacity and complexity of the model.
• Some of the hyperparameters are used for
the optimization of the models, such as
Batch size, learning rate, etc., and some are
specific to the models, such as Number of
Hidden layers, etc.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
Convolutional Neural
Networks.
Images, Text,
Sound etc.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
255 200 211 235 1 255 200 211 235 1

200 161 217 233 0 200 161 217 233 0


218 65 214 237 0 218 65 214 237 0
232 29 217 236 1 232 29 217 236 0
234 23 216 240 0 234 23 216 240 0
102 31 217 234 0 102 31 217 234 0
Computer Interpretation

• For a grayscale images, the pixel value is a single number that represents the brightness of the pixel. The most common pixel
format is the byte image, where this number is stored as an 8-bit integer giving a range of possible values from 0 to 255.
• Similarly for color images, each level is represented by the range of decimal numbers from 0 to 255 (256 levels for each
color), equivalent to the range of binary numbers from 00000000 to 11111111, or hexadecimal 00 to FF. The total number of
available colors is 256 x 256 x 256, or 16,777,216 possible color.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉 𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾𝐾 86%
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 7%
𝑀𝑀. 𝑆𝑆. 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 5.8%
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 1.2%

Classification
Input Image Pixel
Representation

Classification
Model

Amity Centre for Artificial Intelligence, Amity University, Noida, India


FACE

Facial Structure Eye, Nose, Ears

CAR
Edge, Corners

Vehicle Shape and Structure Head Light, Tyre


High-level Features Mid-level Features Low-level Features

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully connected neural networks (FCNNs) are a type of artificial neural network where
the architecture is such that all the nodes, or neurons, in one layer are connected to the
neurons in the next layer.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Input Image
(2D, Matrix of Pixels)
x1 x2 xn

Fully Connected Layer


(Connects neurons of input layer and
hidden layer, has multiple parameters,
no spatial information)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Connect Input Layer
Convolution Filter patches to neurons of
Filter Size = 4 ꓫ 4
Number of Weights = 16
hidden layer/subsequent
Shift or Stride = 2 layer with sliding window
approach.

Step 1:
Extract Set of Local Features by
applying filters (set of weights)
Step 2:
Apply Multiple Filters for
extraction of different features
Step 3:
Spatial Sharing of parameters
for each filter
Input Image (Array of Pixels)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolutional Neural Network (Feature
Extraction and Convolution)

Input Image
(2-D array of
pixels)
Convolutional Neural
Network X or O
Convolutional Neural
Network X
Convolutional Neural
Network O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Challenging Cases
Rotation Weighted Translation Scaling

Convolutional Neural
Network X

Convolutional Neural
Network O

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computer and Human Interpretation

=
Human
Interpretation

=
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
Computer -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
Interpretation -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computer Interpretation

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 X -1 -1 -1 -1 X X -1
-1 X X -1 -1 X X -1 -1
-1 -1 X 1 -1 1 -1 -1 -1
Pixel wise
-1 -1 -1 -1 1 -1 -1 -1 -1
Matching -1 -1 -1 1 -1 1 X -1 -1
-1 -1 X X -1 -1 X X -1
-1 X X -1 -1 -1 -1 X -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

=x
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 -1 -1 -1
Decision -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Computers are Literal

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Feature matching for symbol ‘X’

=
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Piece Matching of Features

Features

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Piece Matching of Features

1 -1 1
-1 1 -1
1 -1 1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

1 -1 -1
-1 1 -1
-1 -1 1
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation
Feature Map
1 -1 -1
-1 1 -1
-1 -1 1
Filter
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation
Feature Map
1 -1 -1 1 1 -1
-1 1 -1 1 1 1
-1 -1 1
-1 1 1
Filter
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 .55
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
=
1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution Operation
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1

=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1

=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1

=
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolutional layer

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolutional layer

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernel vs Filter

Kernel is that matrix which is


swiped precisely convolved
across a single channel of the
input.
Filter is the collection of all
kernels which are convolved on
the channels of the input.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Identity Kernel

Original Image Output Image – Same as Original

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Blur

Original Image Output Image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Left Sobel

Original Image Output Image

Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Right Sobel

Original Image Output Image

Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Bottom Sobel

Original Image Output Image

Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Top Sobel

Original Image Output Image

Sobel kernels are used to show only the differences in adjacent pixel values in a particular direction

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Emboss

Original Image Output Image


The emboss kernel (similar to the Sobel kernel and sometimes referred to mean the same) givens the illusion of depth by
emphasizing the differences of pixels in a given direction. In this case, in a direction along a line from the top left to the bottom right.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Outline

Original Image Output Image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Kernels and their effects on an Image

Sharpen

Original Image Output Image

The sharpen kernel emphasizes differences in adjacent pixel values. This makes the image look more vivid.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


CONV: Convolutional
kernel layer
RELU: Activation
function
POOL: Dimension
reduction layer
FC: Fully connection
layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolutional Neural Network--- Spatial View

Depth
32 Dimensions of Layer
H*W*D
Height
H (height) and W (width)
are spatial dimensions
whereas D (depth) is
Width number of filters
32
3
Stride = Step size of filter, Receptive Field = Location of connected path in an input image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Non-Linearity in
Convolutional Neural
Network
Applied after every convolutional layer. The
rectified linear activation (ReLU) function is a
simple calculation that returns the value
provided as input directly, or the value 0.0 if
the input is 0.0 or less.

g(x) = max(0, x)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Rectified Linear Units (ReLUs)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Rectified Linear Units (ReLUs)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Rectified Linear Units (ReLUs)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Rectified Linear Units (ReLUs)

0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33

-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0

0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11

-0.11 0.33 -0.77 1.00 -0.77 0.33 -0.11 0 0.33 0 1.00 0 0.33 0

0.11 -0.55 0.55 -0.77 0.55 -0.55 0.11 0.11 0 0.55 0 0.55 0 0.11

-0.55 0.55 -0.55 0.33 -0.55 0.55 -0.55 0 0.55 0 0.33 0 0.55 0

0.33 -0.55 0.11 -0.11 0.11 -0.55 0.33 0.33 0 0.11 0 0.11 0 0.33

0.33 -0.11 0.55 0.33 0.11 -0.11 0.77 0.33 0 0.55 0.33 0.11 0 0.77

-0.11 0.11 -0.11 0.33 -0.11 1.00 -0.11 0 0.11 0 0.33 0 1.00 0

0.55 -0.11 0.11 -0.33 1.00 -0.11 0.11 0.55 0 0.11 0 1.00 0 0.11

0.33 0.33 -0.33 0.55 -0.33 0.33 0.33 0.33 0.33 0 0.55 0 0.33 0.33

0.11 -0.11 1.00 -0.33 0.11 -0.11 0.55 0.11 0 1.00 0 0.11 0 0.55

-0.11 1.00 -0.11 0.33 -0.11 0.11 -0.11 0 1.00 0 0.33 0 0.11 0

0.77 -0.11 0.11 0.33 0.55 -0.11 0.33 0.77 0 0.11 0.33 0.55 0 0.33

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Pooling
STEPS

• Dimensionality Reduction 1. Pick a window size


• Preserve Spatial Invariance (usually 2 or 3).
2. Pick a stride (usually 2).
The types of pooling operations are: 3. Walk your window across
Max pooling: The maximum pixel value your filtered images.
of the batch is selected. 4. From each window, take
Average pooling: The average value of the maximum value.
all the pixels in the batch is selected.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why Pooling ?
• Subsampling pixels will not change the object
bird
bird

Subsampling

We can subsample the pixels to make image


smaller fewer parameters to characterize the image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max-Pooling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Pooling layer
• A stack of images becomes a stack of smaller images.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stride
• number of cells the filter is moved to calculate the next output
• sample only every s pixels in each direction in the output

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stride
• number of cells the filter is moved to calculate the next output
• sample only every s pixels in each direction in the output

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stride
• Stride = 2
• First Value:

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stride
• Stride = 2
• Next Value:
Size of output
feature map may
decrease

-4

-4 8*(-1) +
0*(-1) +
5*(-1)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Padding
In order to assist the kernel with
processing the image, padding is added
to the frame of the image to allow for
more space for the kernel to cover the
image. Adding padding to an image
processed by a CNN allows for more
accurate analysis of images.
• Use Conv without shrinking the height
and width
• Helpful in building deeper networks
• Keep more of the information at the
border of an image
Zero-Padding

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Same Padding

• Buffers the edge of


the input with -15

filter_size/2 zeros
(integer division)
• Output dimension is
the same as the
input for s=1 1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Same Padding

• Buffers the edge of


the input with
filter_size/2 zeros
(integer division)
• Output dimension is
the same as the input
for s=1
• Output dimension
reduces less for s>1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1

……
6 x 6 image
Each filter detects a small pattern (3 x 3).
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1 Dot
product
1 0 0 0 0 1
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
6 x 6 image
Amity Centre for Artificial Intelligence, Amity University, Noida, India
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map
0 1 Two 3X3 Kernels
-1 -1 -2 1
0 0 1 0 1 0 Forming 4 x 4 x 2 matrix
3 -2 -2 -1
6 x 6 image -1 0 -4 3
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0

connected 1 0 0 0 1 0




0 1 0 0 1 0
0 0 1 0 1 0
x36
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 7 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9: 0
0 1 0 0 1 0 0


0 0 1 0 1 0
13 0 Only connect to
6 x 6 image 9 inputs, not
14 0
fully connected
fewer parameters! 15 1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
1 0 0 0 0 1


0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
6 x 6 image 13: 0
Fewer parameters 14: 0
Shared weights
15: 1
16: 1
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Click the link below or copy paste the URL in your browser

https://poloclub.github.io/cnn-explainer/

With the CNN Explainer you can Learn and implement Convolutional Neural
Network (CNN) in your browser! With real sample image dataset

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stacking of Layers

Convolutional Activation Pooling


Layer Function (Max-Pooling)
(ReLU)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Multiple Stacking of Layers

-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 Conv. Acti. Fun. Conv. Acti. Fun. Pooling Conv. Acti. Fun.
-1 -1 1 -1 -1 -1 1 -1 -1 Layer (ReLU) Layer (ReLU) (Max- Layer (ReLU)
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 Pooling)
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully Connected Layer (Training Phase)

X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Training Phase)

X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Training Phase)

X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Training Phase)

X
O
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Fully Connected Layer (Testing Phase)
0.9

X
0.65

0.9 0.65 0.45

0.45 0.87 0.87

0.96
0.96 0.73
0.73
0.23 0.63
0.23

O
0.63
0.44 0.89
0.44
0.94 0.53
0.89

0.94

0.53

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully Connected Layer (Testing Phase)
0.9

X
0.65

0.9 0.65

0.45 0.87
0.45

0.87
0.912
0.96

0.96 0.73 0.73

0.23 0.63
0.23

O
0.63
0.44 0.89
0.44
0.94 0.53
0.89

0.94

0.53

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully Connected Layer (Testing Phase)
0.9

X
0.65

0.9 0.65 0.45

0.45 0.87 0.87

0.96

0.96 0.73 0.73

0.23 0.63
0.23

O
0.63
0.44 0.89
0.44
0.94 0.53
0.89

0.94

0.53

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully Connected Layer (Testing Phase)
0.9

X
0.65

0.9 0.65 0.45

0.45 0.87 0.87

0.96

0.96 0.73 0.73

0.23 0.63
0.23

O
0.63
0.44 0.89

0.94 0.53
0.44
0.517
0.89

0.94

0.53

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully Connected Layer
0.9

0.65

X
0.45

0.87
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.96
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 0.73
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1 0.23

O
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.63
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.44

0.89

0.94
Fully Connected
Layer
0.53

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Multiple Stacking of Fully Connected Layers

0.9

X
0.65

0.45

0.87
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1 0.96

-1 -1 1 -1 -1 -1 1 -1 -1
0.73
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1 0.23

O
-1 -1 -1 1 -1 1 -1 -1 -1
0.63
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1
0.44
Fully Fully Fully
0.89 Connected Connected Connected
Layer 1 Layer 2 Layer 3
0.94

0.53

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stacking of Multiple Layers
-1 -1 -1 -1 -1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 -1 -1 1 -1 -1 -1 -1
-1 -1 -1 1 -1 1 -1 -1 -1
-1 -1 1 -1 -1 -1 1 -1 -1
-1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1

Conv. Acti. Conv. Acti. Pooling Conv.


Layer
Acti.
Fun.
Pooling
(Max-
Fully
Conn.
Fully
Conn.
X
Layer Fun. Layer Fun. (Max-
(ReLU) (ReLU) Pooling) (ReLU) Pooling) Layer Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Convolutional Neural Network- Classification

Classification

Class A

Input Class B
Image
Class C

Class D

• Convolutional layer and Pooling help to extract high level features of input
• Fully connected layer used extracted high level features for classification of input image in different classes
• Output also include the class probability of the image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Formulas
(Output Dimensions Calculations)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution operation
if a 𝑚𝑚 ∗ 𝑚𝑚 image convolved
with 𝑛𝑛 ∗ 𝑛𝑛 kernel,
the output image is of
size (𝑚𝑚 − 𝑛𝑛 + 1) ∗ (𝑚𝑚 − 𝑛𝑛 +
1).

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Padding

if a 𝑛𝑛∗𝑛𝑛 matrix convolved


with an f*f matrix the
with padding p then the
size of the output image
will be (n + 2p — f + 1) *
(n + 2p — f + 1) where p
=1 in this case.

Padded image convolved with 2*2 kernel

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stride

Stride is the
number of
pixels shifts
over the
input matrix.

left image: stride =0, middle image: stride = 1, right image: stride =2

The Math.floor() static method


For padding p, filter size 𝑓𝑓∗𝑓𝑓 and input image size 𝑛𝑛 ∗ 𝑛𝑛 and stride ‘𝑠𝑠’ always rounds down and returns the
our output image dimension will be largest integer less than or equal to
[Math.floor{(𝑛𝑛 + 2𝑝𝑝 − 𝑓𝑓 ) / 𝑠𝑠} + 1] ∗ [Math.floor{(𝑛𝑛 + 2𝑝𝑝 − 𝑓𝑓 ) / 𝑠𝑠} + 1]. a given number.

If an image is 100×100, a filter is 6×6, the padding is 7, and the stride is 4, the result of convolution will be
(100 – 6 + (2)(7)) / 4 + 1 = 28×28.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep Neural
Networks

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep Learning in 2D image (Recap)
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
2x2, Stride=2 Stride=1 2x2, Stride=2
Stride=1

Cat

8 Channels 8 Channels 16 Channels 16 Channels


INPUT
28x28x3
Input Image Classification Model 10 Units
64 Units
Output Class

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep Learning in 1D Signal
Convolutional layers Flatten layer Fully-Connected Layer

Cat

Input Signal Output Class


Classification Model

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal

Inverted Kernel Convolution Operation

1 2 .5 0

Time series Signal


Padding = Same
0 1 0 0 0 0 -1 0 0 0 0

Time series Signal Representation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal

Inverted Kernel Convolution Operation

1 2 .5 0 2

Time series Signal


Padding = Same
0 1 0 0 0 0 -1 0 0 0 0

Time series Signal Representation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal

Inverted Kernel Convolution Operation

1 2 .5 0 2 0
+
Time series Signal
2 Padding = Same
0 1 0 0 0 0 -1 0 0 0 0

Time series Signal Representation

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Time series Signal
0 0 1 0 0 0 0 -1 0 0 0

Inverted Kernel
Padding = Same
1 2 .5

=
Time series Signal
0 .5 2 1 0 0 -.5 -2 -1 0 0

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel Kernel size=3

Time series Signal


X X

Padding = Same

Convolution
Inverted Kernel

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution in Time Series Signal
Kernel size=3

Time series Signal


X X

Padding = Same

Convolution

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 -.5 -2 -1 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 0 0 -1 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 0 0 0 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

0 .5 2 1 0 0 -.5 -2 -1 0 0
Time series Signal

ReLu Activation

0 .5 2 1 0 0 0 0 0 0 0 Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function

Time series Signal

ReLu Activation

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0

max

Time series Signal


.5

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0

max

Time series Signal


.5 2

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0

max

Time series Signal


.5 2 0

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0

max

Time series Signal


.5 2 0 0

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Time series Signal
0 .5 2 1 0 0 0 0 0 0 0

max

Time series Signal


.5 2 0 0 0

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Max Pool size=2

Time series Signal

X X

Pooling
max

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Max Pool size=2

Time series Signal

X X

Pooling
max

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Max Pool size=2

Time series Signal

X X

Pooling
max

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Max Pool size=2

Time series Signal

X X

Pooling
max

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Max Pooling
Max Pool size=2

Time series Signal

X X

Pooling
max

Result

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution with 2 kernels
0 0
0 0.5
1 2
0 1 1
0 2 0 Padding = Same
0 0.5 0 Activation = ReLU
0 Inverted Kernel 1 0
-1 0
0 0
0 0
0 0
Time series Signal
Feature Maps

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution with 2 kernels
0 0
0
0 0.5
2
1 2
1.5
0 1 1
0 0
0 2 0 Padding = Same
1.5 0
0 0.5 0 Activation = ReLU
2 0
0 0
Inverted Kernel 2 0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0
0 0.5
2
1 2
1.5
0 1 1
0 0
0 2 0 Padding = Same
1.5 0
0 0.5 0 Activation = ReLU
2 0
0 0
0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0 0.5
0 0.5
2
1 2
1.5
0 1 1
0 0
0 2 0 Padding = Same
1.5 0
0 0.5 0 Activation = ReLU
2 0
0 0
0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5
0 1 1
0 0
0 2 0 Padding = Same
1.5 0
0 0.5 0 Activation = ReLU
2 0
0 0
0
-1 0
0
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
0 2 0 Padding = Same
1.5 0 .
0 0.5 0 Activation = ReLU
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0
0 0
0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
0 2 0 Padding = Same
1.5 0 .
0 0.5 0 Activation = ReLU
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0 0
0 0
0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
0 2 0 Padding = Same
1.5 0 .
0 0.5 0 Activation = ReLU
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0 0
0 0
0 0
0 0
0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Flatten Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 .
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 .
0 0 .
0 .
-1 0
0 .
0 0
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Fully Connected Layer
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 . 7
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2
0 0 .
0 .
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer FC Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 . 7
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2
0 0 .
0 . Sigmoid
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer FC Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2
0 0 .
0 . Sigmoid
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer FC Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2 0.1
0 0 .
0 . Sigmoid
-1 0
0 .
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer FC Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Activation Function
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 .
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2 0.1
0 0 .
0 . Sigmoid
-1 0
0 . 0.59
0 0 0.4
0 0
0 0
0 0
0 0
0 0
Time series Signal
Feature Maps Flatten Layer FC Layer

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Classification
0
0 0
0 0.5
0 0.5
2 2
1 2
1.5 . Class 1
0 1 1 7 0.99
.
0 0 .
0 2 0
1.5 0 .
0 0.5 0
2 0 . -2 0.1
0 0 .
0 . Sigmoid
-1 0
0 . 0.59
0 0 0.4
0 0
0 0
0 0
0 0
0 0 Probability value
Time series Signal
Feature Maps Flatten Layer FC Layer for 3 classes

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Time Series Signal Classifications using CNN

Input CNN Architecture for 1D signal Output

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
Popular CNN architecture.
Convolutional Neural
Networks and Transfer
Learning
Popular CNN Architectures

Amity Centre for Artificial Intelligence, Amity University, Noida, India


ImageNet Dataset
• ImageNet is a dataset of
• over 15 million labelled high-resolution
images
• from ~22,000 categories
• ImageNet Large Scale Visual
Recognition Challenge (ILSVRC):
• Between 2010 -2017
• Uses ~1000 categories, each with ~1000
images

Amity Centre for Artificial Intelligence, Amity University, Noida, India


ImageNet Large Scale Visual Recognition Challenge (ILSVRC):

Algorithms that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2010-2017. The top-5 error refers to the
probability that all top-5 classifications proposed by the algorithm for the image are wrong. The algorithms with blue graph are
convolutional neural network. Although VGGNet took second place in 2014, it is widely used in studies as its concise structure.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AlexNet

AlexNet is a pioneering
convolutional neural network
(CNN) used primarily for image
recognition and classification
tasks.
It won the ImageNet Large Scale
Visual Recognition Challenge in
2012, marking a breakthrough in
deep learning. AlexNet’s
architecture, with its innovative
use of convolutional layers and
rectified linear units (ReLU), laid Alexnet won the Imagenet large-scale visual recognition
the foundation for modern deep challenge in 2012. The model was proposed in 2012 in the
learning models, advancing research paper named ”Imagenet Classification with Deep
computer vision and pattern Convolution Neural Network” by Alex Krizhevsky and his
recognition applications. colleagues.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
1. The Alexnet has eight layers with learnable
parameters.
2. The model consists of five layers with a
combination of max pooling followed by 3 fully
connected layers and they use Relu activation in
each of these layers except the output layer.
3. They found out that using the Relu activation
function accelerated the speed of the training by
almost six times.
4. They also used the dropout layers, that prevented
their model from overfitting. The model is trained
on the Imagenet dataset.
5. Total No. of parameters in this architecture is
62.3 million.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


AlexNet
• Input:
227x227x3
• Conv 11x11,
5x5, three 3x3
• 3 MaxPool
• ReLu for
hidden units
• Softmax for
output

6x6x256=9216

Dropout p=0.5 Dropout p=0.5


P.S. : * Output= ((Input-filter size+2p)/
stride)+1
Amity Centre for Artificial Intelligence, Amity University, Noida, India
AlexNet

P.S. : * Output= ((Input-filter size)/ stride)+1


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary Highlights in AlexNet
• ReLU activation (avoid vanishing gradient),
• Data Augmentation (avoid overfitting),
• Dropout regularization (avoid co-adaptation)
• Introduced Local Response Normalization (LRN)
• LRN is a non-trainable layer that square-
normalizes the pixel values in a feature map
within a local neighbourhood (Inter-channel,
Intra-channel)
• does lateral inhibition: refers to the capacity of a
neuron to reduce the activity of its neighbours

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Inter-Channel LRN
The neighborhood defined is across the channel. For each (x,y)
position, the normalization is carried out in the depth
dimension and is given by the following formula

where i indicates the output of filter i, a(x,y), b(x,y) the pixel


values at (x,y) position before and after normalization
respectively, and N is the total number of channels. The
constants (k,α,β,n) are hyper-parameters. k is used to avoid
any singularities (division by zero), α is used as a
normalization constant, while β is a contrasting constant. The
constant n is used to define the neighborhood length i.e. how
many consecutive pixel values need to be considered while
carrying out the normalization. The case of (k,α, β,
n)=(0,1,1,N) is the standard normalization).
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Intra-Channel LRN
In Intra-channel LRN, the neighborhood is extended within the
same channel only.

where (W,H) are the width and height of the feature map. The
only difference between Inter and Intra Channel LRN is the
neighborhood for normalization. In Intra-channel LRN, a 2D
neighborhood is defined (as opposed to the 1D neighborhood
in Inter-Channel) around the pixel under-consideration.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


VGGNet
VGG stands for Visual Geometry Group

The VGG architecture is the basis of ground-


breaking object recognition models.
Why? VGGNet was born out of the need to
reduce the # of parameters in the CONV layers
and improve on training time.
What? There are multiple variants of VGGNet
• VGG- Network is a convolutional (VGG16, VGG19, etc.) which differ only in the
neural network model proposed by K. total number of layers in the network.
Simonyan and A. Zisserman in the
paper “Very Deep Convolutional
Networks for Large-Scale Image
It is one of the famous architectures in the deep learning
Recognition”. field. Replacing large kernel-sized filters with 11 and 5 in
• This architecture achieved top-5 test the first and second layer respectively showed the
accuracy of 92.7% in ImageNet, improvement over AlexNet architecture, with multiple 3×3
which has over 14 million images kernel-sized filters one after another. It was trained for
belonging to 1000 classes. weeks and was using NVIDIA Titan Black GPU’s.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
VGGNet
• Developed by Visual Geometry Group in 2014
• VGG16 was 2nd in ILSVRC challenge 2014 (top-5 classification error of 7.32%)
• Characterized by Simplicity and Depth
• All Conv layers with 3x3 filters and stride 1, SAME padding
• All max polling layers 2x2 filters, stride 2
• VGG16: 16-layer CNN (16 layers with trainable parameters, over 134 million
parameters); VGG19: 19-layer CNN (more than 144 million parameters)
• VGG19: The concept of the VGG19 model (also VGGNet-19) is the same as the
VGG16 except that it supports 19 layers. The “16” and “19” stand for the number
of weight layers in the model (convolutional layers). This means that VGG19 has
three more convolutional layers than VGG16.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


VGGNet

Conv= 3x3 filter, s=1, same ReLU activation in all hidden units
Max pool= 2x2, s=2 (5 Max pooling layers) Softmax activation in output units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
3

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Architecture of VGG:

Input: The VGGNet takes in an image input size of 224×224. For the ImageNet competition,
the creators of the model cropped out the center 224×224 patch in each image to keep the
input size of the image consistent.
Convolutional Layers:
VGG’s convolutional layers leverage a minimal receptive field, i.e., 3×3, the smallest possible
size that still captures up/down and left/right. Moreover, there are also 1×1 convolution filters
acting as a linear transformation of the input. This is followed by a ReLU unit, which is a huge
innovation from AlexNet that reduces training time.
The convolution stride is fixed at 1 pixel to keep the spatial resolution preserved after
convolution (stride is the number of pixel shifts over the input matrix).
Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not usually
leverage Local Response Normalization (LRN) as it increases memory consumption and
training time. Moreover, it makes no improvements to overall accuracy.
Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the three
layers, the first two have 4096 channels each, & the third has 1000 channels, 1 for each class.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Complexity and challenges
• The number of filters that we can use doubles on every step or through every stack of the
convolution layer. This is a major principle used to design the architecture of the VGG16
network.
• One of the crucial downsides of the VGG16 network is that it is a huge network, which means
that it takes more time to train its parameters.
• Because of its depth and number of fully connected layers, the VGG16 model is more than
533MB. This makes implementing a VGG network a time-consuming task.

Performance of VGG Models


• VGG16 highly surpasses the previous versions of models in the ILSVRC-2012 and ILSVRC-2013
competitions. Moreover, the VGG16 result is competing for the classification task winner
(GoogLeNet with 6.7% error) and considerably outperforms the ILSVRC-2013 winning
submission Clarifai. It obtained 11.2% with external training data and around 11.7% without it.
• In terms of the single-net performance, the VGGNet-16 model achieves the best result with
about 7.0% test error, thereby surpassing a single GoogLeNet by around 0.9%.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stacking of Conv
Conv3
• Multiple Stacked Conv layers lead to Conv3
Wide Receptive Field
• In VGG, varying filter sizes are
implemented by stacking Conv layers
with fixed filter sizes

5x5 3x3
Effective Receptive Field

Amity Centre for Artificial Intelligence, Amity University, Noida, India


GoogLeNet (Inception v1)
• Developed by Google in 2014 Inception Module

• 1st position in ILSVRC challenge 2014


(top-5 classification error of 6.66%)
• 22-layers with trainable parameters
(27 layers including Max Pool layers)
• Parameters: 5 million (V1), 23 million (V3)
• Contains Inception Modules

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
"Going deeper with convolutions." In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1-9. 2015.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
GoogLeNet (Inception v1)

Inception Module/ Cell Inception Module


• Extract features at different scales from the
input (1x1, 3x3, 5x5)
• Max pooling with "same" padding to
preserve dimensions
• 1x1 Conv to decrease the number of feature
maps (feature-map pooling layer)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
GoogLeNet (Inception v1)

9 Inception Modules

Final
Classifier

Auxiliary Classifier

Auxiliary Classifier

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Final
Classifier

Auxiliary Classifier

Auxiliary Classifiers
Auxiliary Classifier
• Intermediate softmax branches at
• 5×5 Average
the middle Pooling (Stride 3)
• Only used during training • 1×1 Conv (128
filters)
• Purpose: combating vanishing • 1024 FC
gradient problem, regularization • 1000 FC
• Softmax
• Loss is added to the total loss,
with weight 0.3
27
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
What’s Novel in GoogleNet?
Inception module,
1x1 convolutions,
Global average pooling,
Auxiliary classifiers,
Increased network depth(22 layers).

Their architecture consisted of a 22 layer deep CNN but


reduced the number of parameters from 60 million (AlexNet)
to 4 million.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Glimpse of Backpropagation Algorithm

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Glimpse of Backpropagation Algorithm
• After propagating the input features forward
to the output layer through the various
hidden layers consisting of different/same
activation functions, we come up with a
predicted probability of a sample belonging
to the positive class (generally, for
classification tasks).
• Now, the backpropagation algorithm
propagates backward from the output layer
to the input layer calculating the error
gradients on the way.
• Once the computation for gradients of the
cost function w.r.t each parameter (weights
and biases) in the neural network is done,
the algorithm takes a gradient descent step
towards the minimum to update the value
of each parameter in the network using
these gradients.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Vanishing Gradient
• As the backpropagation
algorithm advances
downwards(or backward)
from the output layer
towards the input layer,
the gradients often get
smaller and smaller and
approach zero which
eventually leaves the
weights of the initial or
lower layers nearly
unchanged.
• As a result, the gradient
descent never converges
to the optimum. This is
known as the vanishing
“The gradients will be very small for the earlier layers, means there is no major
gradients problem.
difference between the new weight and old weight.”

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Vanishing Gradient
• The deterioration in the
gradient value is
proportional to the
depth of the network.

• The deeper the network,


the chance of getting a
lesser value of the
gradient towards the
end of back propagation.

• Vanishing gradient
problem is mainly occurs
with sigmoid and tanh
functions.
“The gradients will be very small for the earlier layers, means there is no major
difference between the new weight and old weight.”

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Vanishing Gradient- Example
• Activation Functions such as the sigmoid
function have a very prominent difference
between the variance of their inputs and
outputs.
• They shrink and transform a large input
space into a smaller output space, which
lies between [0,1].
• Using larger inputs, regardless if they are
negative or positive will classify at either 0
or 1.
• However, when the Backpropagation
processes, it has no gradient to propagate
backward in the Neural Network.
• The little gradient that does exist, will
continuously keep diluting as the algorithm
continues to process through the top
layers, leaving nothing for the lower layers.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exploding Gradient
• On the contrary, in some
cases, the gradients keep
on getting larger and
larger as the
backpropagation
algorithm progresses.

• This, in turn, causes very


large weight updates and
causes the gradient
descent to diverge.

• This is known as
the exploding
gradients problem.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Exploding Gradient
• Similarly, in some cases suppose
the initial weights assigned to the
network generate some large loss.

• Now the gradients can accumulate


during an update and result in very
large gradients which eventually
results in large updates to the
network weights and leads to an
unstable network.

• The parameters can sometimes


become so large that they overflow
and result in NaN values.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Vanishing Gradient vs. Exploding Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


How to identify a vanishing or exploding gradients problem?
Vanishing Exploding
• Large changes are observed in parameters of later layers, whereas
• Contrary to the vanishing scenario, exploding
parameters of earlier layers change slightly or stay unchanged
gradients shows itself as unstable, large
• In some cases, weights of earlier layers can become 0 as the parameter changes from batch/iteration to
training goes. batch/iteration
• The model learns slowly and often times, training stops after a few • Model weights can become NaN very quickly
iterations
• Model performance is poor • Model loss also goes to NaN

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Methods to solve the problem of Vanishing/Exploding gradients
• Using less number of layers • Using the correct activation functions
A straightforward approach to solving Saturating functions such as sigmoid saturate the larger inputs
vanishing and exploding gradient problems and causes vanishing gradient problem. We can use non-
is to use less number of layers in our saturating activation functions such as ReLU and its alternatives
network. Using fewer layers will ensure that such as leaky ReLU.
the gradient is not multiplied too many
times. This may stop the gradient from
• Using batch normalization
vanishing or exploding, but it does cost us
Using batch normalization ensures that vanishing/exploding
the ability of our network to understand
gradients do not appear in between the layers.
complex features.
• Gradient clipping
• Careful weight initialization
It is a popular method used to solve the exploding gradient
We can solve both of these problems
problem. It limits the size of the gradients so that they never
partially by carefully choosing our model
exceed some specified value.
parameters initially.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Methods to solve the problem of Vanishing/Exploding gradients

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Methods to solve the problem of Vanishing/Exploding gradients
• Skip Connections
 Skip connections (used in ResNet)
prevent the vanishing gradient
problem during deep neural
network training.

 These connections enable the


direct flow of information from
earlier layers to later layers, aiding
in preserving gradient and
promoting better convergence.

 The loss surface of the neural


network with skip connections is
smoother and thus leading to faster
convergence than the network
without any skip connections. The loss surfaces of ResNet-56 with and without skip connections

Amity Centre for Artificial Intelligence, Amity University, Noida, India


ResNet

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Vanishing/Exploding Gradient:
This is one of the most common problems On the other end of the spectrum, there are cases
plaguing the training of larger/deep neural when the gradient reaches orders up to 10⁴ and more.
networks and is a result of oversight in terms As these large gradients multiply with each other, the
of numerical stability of the network’s values tend to move towards infinity. Allowing such a
parameters. large range of values to be in the numerical domain for
During back-propagation, as we keep moving weights makes convergence difficult to achieve.
from the deep to the shallow layers, the chain (Exploding Gradient)
rule of differentiation makes us multiply the
gradients. Often, these gradients are small, to
the order of 10^{-5} or more.
ResNet, due to its architecture, does not
According to some simple math, as these small allow these problems to occur at all.
numbers keep getting multiplied with each
The skip connections do not allow it as they act as gradient
other, they keep becoming infinitesimally
super-highways, allowing it to flow without being altered by a
smaller, making almost negligible changes to
large magnitude.
the weights.
(Vanishing Gradient)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
The ResNet architecture is considered to be among the most popular
Convolutional Neural Network architectures around. Introduced by
Microsoft Research in 2015, Residual Networks (ResNet in short) broke
several records when it was first introduced.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Residual Block
Residual Block
• Skip connection skips training from a few
layers and connects directly to the output
• Instead of learning the underlying
mapping H(x) from stacked layers, let
network learn the residual F(x) = H(x)-x
• Hence, after adding identity, F(x)+x =
H(x)
• Speeds learning by reducing the impact of In mathematical terms, it would mean y=x+F(x) where y is
vanishing gradients, avoid degradation the final output of the layer.
In terms of architecture, if any layer ends up damaging the
• Enable development of Deeper Networks
performance of the model in a plain network, it gets
skipped due to the presence of the skip-connections.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


ResNet

• Proposed by Shaoqing Ren, Kaiming He, Jian Sun, and Xiangyu Zhang

• Vanishing Gradient problem of Deep NN: with increase in depth

• 1st position in ILSVRC challenge 2015 (top-5 classification error of 3.57%)

• ResNet-50(50 conv layers) parameter count of approximately 25.6 million makes it


a moderately large network compared to earlier architectures.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition."
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016, DOI:
https://arxiv.org/abs/1512.03385.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
What’s Novel in ResNet?

Residual Connection
High Accuracy
Bottleneck layers

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Different versions of the ResNet architecture use a varying number of blocks at different levels.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
What’s Novel in ResNet?

Their architecture reduced the number of parameters from 60 million


(AlexNet) to 4 million.
Popularized skip connections (they weren’t the first to use skip
connections).
Designing even deeper CNNs (up to 152 layers) without compromising
model’s generalization power.
Among the first to use batch normalization.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


CNN: Utility of Layers

Amity Centre for Artificial Intelligence, Amity University, Noida, India


DenseNet Architecture

Amity Centre for Artificial Intelligence, Amity University, Noida, India


DenseNet Architecture
DenseNet, or Densely Connected Convolutional Network, is a type of convolutional neural network
that uses dense connections between layers.
DenseNets are feed-forward networks that connect each layer to every other layer. They are used
to increase the depth of a convolutional neural network
DenseNets have several advantages, including:
• Reduced gradient disappearance: DenseNets reduce the problem of vanishing gradients, which
is difficult to optimize in deep networks.
• Feature propagation: DenseNets strengthen feature propagation.
• Feature reuse: DenseNets encourage feature reuse.
• Number of parameters: DenseNets substantially reduce the number of parameters.
• Compact input features: DenseNet provides compact and differentiated input features by
shortcut connections of different lengths.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


DenseNet Architecture
ResNet performs an element-wise addition to pass the output to the next layer or block. DenseNet connects all layers
directly to each other. It does this through concatenation.
With concatenation, each layer receives collective knowledge from the preceding layers.

Because of these dense connections, the model requires fewer layers, as there is no need to learn redundant feature
maps, allowing the collective knowledge (features learned collectively by the network) to be reused.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


DenseNet Architecture
A DenseNet is a type of convolutional neural network that utilises dense connections between layers,
through Dense Blocks, where we connect all layers (with matching feature-map sizes) directly with each
other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers
and passes on its own feature-maps to all subsequent layers.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


DenseNet Architecture

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Resnet vs. DenseNet Architecture
When comparing DenseNet with ResNet, several key differences stand out:

• Skip Connections: ResNet uses skip connections to implement identity mappings, allowing gradients to
flow through the network without attenuation. DenseNet, on the other hand, uses dense connections,
concatenating feature maps from all preceding layers.
• Memory Usage: DenseNets generally require more memory than ResNets due to the concatenation of
feature maps from all layers. This can be a limiting factor in certain applications.
• Parameter Efficiency: DenseNet is often more parameter-efficient than ResNet. It reuses features
throughout the network, reducing the need to learn redundant feature maps.
• Training Dynamics: DenseNets might have a smoother training process due to the continuous feature
propagation throughout the network. However, this can also lead to increased training time and
computational costs.
• Performance: Both architectures have shown exceptional performance in various tasks. ResNet is often
preferred for very deep networks due to its simplicity and lower computational requirements. DenseNet
shines in scenarios where feature reuse is critical and can afford the additional computational cost.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Popular CNN Architectures: LeNet-5
• Proposed by LeCun et al in 1998
• Applied by several banks to recognise hand-written characters on cheques
digitized to 32x32 pixel greyscale input images
• 5 layers with learnable parameters,
7 layer in total
• 2 set of Conv-Subsampling
• 1 Conv, 1 FC
• 1 Output (10 units)

LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document
recognition." Proc. IEEE 86, no. 11 (1998): 2278-2324.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
LeNet-5
• Input: 32x32x1 greyscale images

• C1: + 1 = 27 + 1 = 28

• S1: + 1 = 13 + 1 = 14

• C3: + 1 = 9 + 1 = 10

• S4: +1= 4+1=5

• C5: +1=1
7

• LeNet-5: Architecture has become the standard ‘template’: stacking


convolutions with activation function, and pooling layers, and ending the
network with one or more fully-connected layers.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Popular CNN Architectures
• ImageNet Large Scale
Visual Recognition
Challenge (ILSVRC) Winners
• AlexNet (1st, 2012)
• VGGNet (2nd, 2014)
• GoogLeNet (1st, 2014)
• ResNet (1st, 2015)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Transfer Learning

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Transfer Learning
.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Transfer learning is a machine
learning technique that involves
using Knowledge gained from one
task to improve the performance of
a related task .
Or
Instead of training a model from
scratch for a new task, transfer
learning allow us to reuse a pre-
trained model on a related task and
fine tune it for the new task.

Gif source- https://deepnote.com/@jhon-smith-flores/Transfer-Learning-


864f7d51-84f9-4d43-baa0-6194de7943de
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
How TL works in case of Deep Learning Models?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Transfer Learning
• Improvement of learning in a new task through the transfer of
knowledge from a related task that has already been learned.
• Weight initialization for CNN

• Two major strategies


• ConvNet as fixed feature extractor
• Fine-tuning the ConvNet

Amity Centre for Artificial Intelligence, Amity University, Noida, India


2-When to finetune your model?

• New dataset is large and similar to the original dataset


• fine-tune through the some of the last layers
• New dataset is large and very different from the original dataset
• fine-tune through the some or entire network

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Steps in Transfer Learning

Amity Centre for Artificial Intelligence, Amity University, Noida, India


1-Obtain pre-trained model
• VGG-16
• VGG-19
• Inception V3
• XCeption
• ResNet-50

Amity Centre for Artificial Intelligence, Amity University, Noida, India


2. Create a base model

Amity Centre for Artificial Intelligence, Amity University, Noida, India


3. Freeze layers
• Freezing the starting layers from the pre-trained model is essential to
avoid the additional work of making the model learn the basic
features.
• If we do not freeze the initial layers, we will lose all the learning that
has already taken place. This will be no different from training the
model from scratch and will be a loss of time, resources, etc.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
4. Add new trainable layers

• The only knowledge we are reusing from the base model is the feature extraction layers. We need to
add additional layers on top of them to predict the specialized tasks of the model. These are
generally the final output layers.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
5. Train the new layers
• The pre-trained model’s final output will most likely differ from
the output we want for our model. For example, pre-trained
models trained on the ImageNet dataset will output 1000
classes.
• However, we need our model to work for two classes. In this
case, we have to train the model with a new output layer in
place.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


6. Fine-tune your model
• To extract more specific features for the new task without training the model from
scratch fine tuning can be used .
• Fine-tuning involves unfreezing some part of the base model and training the entire
model again on the whole dataset at a very low learning rate. The low learning rate will
increase the performance of the model on the new dataset while preventing overfitting.
• In this weights of the top layers of the pre-trained model are trained which will force the
weights to be tuned from generic feature maps to features associated specifically with
the data set.
• The first few layers learn very simple and generic features that generalize to almost all
types of images. As you go higher up, the features are increasingly more specific to the
dataset on which the model was trained.
• The goal of fine-tuning is to adapt these specialized features to work with the new
dataset, rather than overwrite the generic learning.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Freeze Fine Tune
• For this unfreeze the base_model and set the bottom layers to be un-trainable. Then recompile the model, and resume training.
base_model.trainable = True
# Let's take a look to see how many layers are in the base model
print("Number of layers in the base model: ", len(base_model.layers))

# Fine-tune from this layer onwards


fine_tune_at = 100

# Freeze all the layers before the `fine_tune_at` layer


for layer in base_model.layers[:fine_tune_at]:
layer.trainable = False
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Application of Transfer Learning
• To extract more specific features for the new task without training the
model from scratch fine tuning can be used .
• In this weights of the top layers of the pre-trained model are trained which
will force the weights to be tuned from generic feature maps to features
associated specifically with the data set.
• The first few layers learn very simple and generic features that generalize to
almost all types of images. As you go higher up, the features are
increasingly more specific to the dataset on which the model was trained.
• The goal of fine-tuning is to adapt these specialized features to work with
the new dataset, rather than overwrite the generic learning.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep learning
• Course Code:
• Unit 3
Convolutional Neural Networks and
Transfer Learning
• Lecture 3
• Parameter sharing, receptive
field, 1D, 2D, 3D convolution,
Convolutional Neural Network
Understanding Receptive field
Field of view

• The human visual system consists of


millions of neurons, where each one
captures different information.

• Defined as neuron’s receptive field as the


patch of the total field of view Or what
information a single neuron has access
to..

ImageSource: https://www.brainhq.com/brain-resources/brain-
connection
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Understanding Receptive field

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Receptive Field in Deep Learning

• Defined as a size of region in input that


produces features. Basically, it is a measure of
association of an output feature (of any layer)
to the input region (patch).
• The idea of receptive fields applies to local
operations (i.e. convolution, pooling).
• A convolutional unit only depends on a local
region (patch) of the input.
• That’s why RF never referred on fully
5x5 3x3 connected layers since each unit has access
to all the input region.
Effective Receptive Field

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Receptive Field in Deep Learning

Illustrating the total receptive field and total stride attributes for the L’th layer, which could be seen as the projected
receptive field and stride with respect to the input layer. Together, they capture the overlapping degree of a network.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we care for Receptive Field?

The green and the orange one. Which one would you like to
have in your architecture?
Image Source: https://developer.nvidia.com/blog/image-segmentation-using-digits-5/
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Why do we care for Receptive Field?

Therefore, our goal is to design a convolutional model so that we


ensure that its RF covers the entire relevant input image region.
Image Source: https://developer.nvidia.com/blog/image-segmentation-using-digits-5/
Amity Centre for Artificial Intelligence, Amity University, Noida, India
How to increase receptive field in a convolutional network?
• Add more convolutional layers (make the network deeper)

• Add pooling layers or higher stride convolutions (sub-sampling)


It is a technique that expands the kernel (input) by
• Use dilated convolutions inserting holes between its consecutive elements.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Dilated convolutions or “atrous convolutions”

Conventional Convolution vs. Dilated convolutions

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Dilated convolutions or “atrous convolutions”

Convolving a 3 × 3 kernel over a


7 × 7 input with a dilation factor
of 2 (i.e., i = 7, k = 3, d = 2, s = 1
and p = 0).

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Dilated convolutions or “atrous convolutions”

• Dilated convolutions “inflate”


the kernel by inserting spaces
between the kernel elements.
• The dilation “rate” is
controlled by an additional
hyperparameter d.
• Implementations may vary,
but there are usually d−1
spaces inserted between
kernel elements such that d =
1 corresponds to a regular
convolution

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Dilated convolutions or “atrous convolutions”
To understand the relationship tying the dilation rate d and the
output size o, it is useful to think of the impact of d on the
effective kernel size. A kernel of size k dilated by a factor d has an
effective size

= k + (k − 1)(d − 1)

For any i, k, p and s, and for a dilation rate d

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Parameter Sharing

• Parameter sharing refers to using the • Kernel is reused (by sliding) when
same parameter for more than one calculating the layer o/p
function in a model • Less weights to store & train

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Equivalent Representation

• Parameter sharing causes the layer to have a


Representation property called equivariance to translation
• Convolution creates 2-D map of where certain
features appear in the input
• If we move the object in the input, its
representation will move the same amount in
the output
Image • Eg: Same kernel for Edge Detection wherever
the edge occurs in the image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


1D, 2D Convolution

• 2D Convolution
k
• 2-directions (x,y) to calculate conv H
• input = (WxHxc), d filters (kxkxc) output = k
(W1xH1xd)
• Eg: Image data (gray or color) c
W
• 1D Convolution
• 1-direction (time) to calculate conv
• input = (time-step x c), c

d filters (k x c) , output (time-step1 x d) k time-step

Eg: Time-series data, text analysis


Amity Centre for Artificial Intelligence, Amity University, Noida, India
2D, 3D Convolution
• 3D Convolution
• 3-directions (x,y,z) to
calculate conv
• input (WxHxLxC), m
filters (kxkxd) output
(W1xH1xL1xm)
• Eg: MRI data, Videos

Amity Centre for Artificial Intelligence, Amity University, Noida, India


3D Convolution: Example

Amity Centre for Artificial Intelligence, Amity University, Noida, India


3D Convolution: Example

Amity Centre for Artificial Intelligence, Amity University, Noida, India


3D Convolution: Example

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Input shape =2D
1 D CNN Batch size =None
Width = Time axis =7
Feature map/ channels =1

Input shape = 3D
2 D CNN Height=5
Width = 7
Feature map/ channels =1

Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels
=depth=1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Input shape =2D
1 D CNN Batch size =None
Width = Time axis =7
Feature map/ channels =1

Input shape = 3D
2 D CNN Height=5
Width = 7
Feature map/ channels =1

Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels =depth=1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Input shape =2D
1 D CNN Batch size =None
Width = Time axis =7
Feature map/ channels =1

Input shape = 3D
2 D CNN Height=5
Width = 7
Feature map/ channels =1

Input shape = 4D
3 D CNN Height=6
Width = 6
Feature map/ channels =depth=1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


•If the input has one channel such
as a grayscale image, then a 3×3
filter will be applied in 3x3x1
blocks.
•If the input image has three
channels for red, green, and blue,
then a 3×3 filter will be applied in
3x3x3 blocks.
•If the input is a block of feature
maps from another convolutional
or pooling layer and has the depth
of 64, then the 3×3 filter will be
applied in 3x3x64 blocks to create
the single values to make up the
single output feature map.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution with Single Channel and Multiple Filters

• Input with 1 channel (eg. grayscale


image) then a 3×3 filter will be applied
in 3x3x1 blocks

Depth of
feature map=
number of
1 channels filters
Filters applied as
(kxkx1)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution over Volume (Multiple Channels)

• RGB images has 3


channels:
Red, Green, Blue
• One kernel for every
input channel to the
layer (each kernel is
unique)
• Each filter = a collection
of kernels

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Convolution with Multiple Channels and Multiple Filters

• Input with 1 channel (eg. grayscale


image) then a 3×3 filter will be applied
in 3x3x1 blocks
• Input with 3 channels ( eg red, green,
and blue for colour image), then a 3×3
filter will be applied in 3x3x3 blocks
• Input is block of feature maps from
another convolutional layer with depth
say 64, then the 3×3 filter will be Depth of
applied in 3x3x64 blocks to create a feature map=
single output feature map 3 channels number of
Filters applied as filters
(kxkx3)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stride Stride is the
number of
pixels shifts
over the
input matrix.

left image: stride =0, middle image: stride = 1, right image: stride =2

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Padding
In padding we add layers of zeroes around the input image matrix and This layer
of zeros is known as padding.
• Valid: when ‘padding = valid’, this means that no padding will Image with
be applied to the image or there will be no zeros added to padding =2
the image.

• Same: When “padding = same” this means that padding will


be applied to the image or zeros will be added to the image.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Padding
• It refers amount of pixels added to an image when it is being processed by kernel or filter.
• Half padding mean half of filter size and full padding mean padding equal to size of filter/kernel.
• Padding is done to reduce the loss of data among the sides/boundary of the image.

Padding effects output image size while filtering in Conv/padding layer (Assumption: Stride =1)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Convolution with Multiple Channels and Multiple Filters

• Input feature maps (axaxb) = (6x6x2),


on applying d filters,
• output feature map is (cxcxd) = (4x4x2)

Output Feature
2 Filters map has depth 2

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Output Dimension

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Determine the Output Dimension

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Determine the Output Dimension

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Determine the Output Dimension

• Input image, I with dimensions (32x32x3)


• Convolution Layer
• A filter size 3x3
• Stride is 1
• Valid padding, and
• Depth/feature maps are 5 (D =5)
• Output dimensions = ?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Determine the Output Dimension

• Input image, I with dimensions (32x32x3)


• Convolution Layer
• A filter size 3x3
• Stride is 1 (s=1)
• Valid padding (p=0), and
• Depth/feature maps are 5 (D =5)
• Output dimensions = 30x30x5
• After Pooling?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Determine the Output Dimension

• Input to Pooling Layer (30x30x5)


• After Pooling with
• Filter size
• Stride

30x30x5

• Eg, Pooling with, Filter size , Stride


• Output dimensions =

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Determine the Output Dimension

• Input to Pooling Layer (30x30x5)


• After Pooling with
• Filter size
• Stride

30x30x5 15x15x5

• Eg, Pooling with, Filter size , Stride


• Output dimensions =15x15x5

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
2x2, Stride=2 Stride=1 2x2, Stride=2
Stride=1
• Conv1

• o/p: INPUT
28x28x1
8 Channels 8 Channels 16 Channels 16 Channels

10 Units
64 Units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
∗ 2x2, Stride=2 Stride=1 2x2, Stride=2
• Conv1 Stride=1

• o/p:
• Param: 3x3x1x8+8
= 80
• 3x3 filter for 1
channel, 8 such 16 Channels
8 Channels 8 Channels 16 Channels
filters and 8 biases INPUT
26x26x8
28x28x1
10 Units
64 Units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
Stride=1
• Conv1: Stride=1 2x2, Stride=2 2x2, Stride=2

• Max-Pool

o/p:
8 Channels 8 Channels 16 Channels 16 Channels
INPUT
26x26x8
28x28x1
10 Units
64 Units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
Stride=1
• Conv1: Stride=1 2x2, Stride=2 2x2, Stride=2

• Max-Pool

o/p:
• Conv2, INPUT
8 Channels 8 Channels 16 Channels 16 Channels
26x26x8 13x13x8
∗ 28x28x1
10 Units
64 Units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
• Conv1: Stride=1 2x2, Stride=2 Stride=1 2x2, Stride=2

• Max-Pool:
• Conv2:

o/p:
8 Channels 8 Channels 16 Channels 16 Channels
INPUT
• Param? 28x28x1
26x26x8 13x13x8 11x11x16

10 Units
64 Units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
• Conv1: Stride=1 2x2, Stride=2 Stride=1 2x2, Stride=2

• Max-Pool:
• Conv2:

o/p:
8 Channels 8 Channels 16 Channels 16 Channels
INPUT
• Param= 28x28x1
26x26x8 13x13x8 11x11x16

3x3x8x 16+16=1168 3x3 filter for 8 channel, 16 such filters 10 Units


64 Units
and 16 biases

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
Stride=1
• Conv1: Stride=1 2x2, Stride=2 2x2, Stride=2

• Max-Pool:
• Conv2:
• Max-Pool:

8 Channels 8 Channels 16 Channels 16 Channels


INPUT
26x26x8 13x13x8 11x11x16 5x5x16
28x28x1
10 Units
64 Units

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Typical CNN Model
Fully-Connected Fully-Connected
Neural Network Neural Network
Conv2 ReLU activation Softmax activation
Conv1 Convolution
Convolution 16 filters, 3x3
8 filters, 3x3 Valid padding
Valid padding Max-Pooling Max-Pooling
Stride=1
• Conv1: Stride=1 2x2, Stride=2 2x2, Stride=2

• Max-Pool:
• Conv2:
• Max-Pool:

• 8 Channels 8 Channels 16 Channels 16 Channels


INPUT
13x13x8 5x5x16
• FC1: 28x28x1
26x26x8 11x11x16

10 Units
64 Units
• FC2: (64+1)×10=650

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Parameters and Hyperparameters

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Deep Learning
Course Code:

Unit 4: Sequential models &


Recurrent Neural Networks (RNNs)

Lecture 1: Introduction to RNNs


and their applications

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need Sequential Modeling

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need Sequential Modeling

Given a Football Image

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need Sequential Modeling

Given a Football Image

Can you predict where


it will go next?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need Sequential Modeling

Now can you predict?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need Sequential Modeling

Since we have knowledge about the


kicking direction by the player, we can
predict its next direction.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Why do we need Sequential Modeling

Since we have knowledge about the


kicking direction by the player, we can
predict its next direction.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Need for Sequential Modeling
Sequence data Input data Output

Speech recognition Wow, it is so nice!

Machine translation You are my best friend Você é meu melhor amigo

Music generation

Name entity GH Hardy said, his GH Hardy said, his


recognition contribution was contribution was discovery
discovery of Ramanujan. of Ramanujan.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Need for Sequential Modeling
Sequence data Input data Output
Sentiment
Wow, it is so nice!
classification

DNA sequence
analysis

Video activity
recognition Fighting

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Can we use ANN/CNN for Sequential Modeling?

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Can we use ANN/CNN for Sequential Modeling?
No

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons:
 Fixed input size
Example : image size 32x32

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons:
 Fixed input size
Example : image size 32x32 Rabbit

Class
Dog
 Fixed output size Cat

Example : probabilities of different classes 0 0.5


Probability
1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons:
 Fixed input size
Example : image size 32x32 Rabbit

Class
Dog
 Fixed output size Cat

Example : probabilities of different classes 0 0.5


Probability
1

 Fixed computational steps


Example : number of layers in the model

Image source: https://medium.com/techiepedia/binary-image-classifier-cnn-using-tensorflow-a3f5d6746697

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons:
Words learnt or approximated at a later position may change the
approximation of a previous word.
Example :
• Blue dresses are looking good.
• Blue dress is looking good.
 Parameter sharing is not done in conventional ANNs.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons:
Words learnt or approximated at a later position may change the
approximation of a previous word.
Example :
• Blue dresses are looking good.
• Blue dress is looking good.
 Parameter sharing is not done in conventional ANNs.

There comes Recurrent Neural Network!


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Reasons for Using Recurrent Neural Network (RNN)
 Can handle inputs and outputs of varying lengths.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons for Using Recurrent Neural Network (RNN)
 Can handle inputs and outputs of varying lengths.
 It involves directed cycles to recognize sequential characteristics of a
data.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons for Using Recurrent Neural Network (RNN)
 Can handle inputs and outputs of varying lengths.
 It involves directed cycles to recognize sequential characteristics of a
data.
 Shares parameters across different parts of the network.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons for Using Recurrent Neural Network (RNN)
 Can handle inputs and outputs of varying lengths.
 It involves directed cycles to recognize sequential characteristics of a
data.
 Shares parameters across different parts of the network.
 Track long-term dependencies.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Reasons for Using Recurrent Neural Network (RNN)
 Can handle inputs and outputs of varying lengths.
 It involves directed cycles to recognize sequential characteristics of a
data.
 Shares parameters across different parts of the network.
 Track long-term dependencies.
 Maintain information about order.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


When to use RNN?

“Whenever there is a sequence of data and the temporal dynamics that


connects the data is more important than the spatial content of each
individual frame.”

– Lex Fridman (MIT)

Image source: https://commons.wikimedia.org/wiki/File:Lex_Fridman_teaching_at_MIT_in_2018.png

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Neural Network: Simplified

Hidden
Input
Output

Weights
Standard feed-
forward network

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1

x1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1 s2

x1 x2

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1 s2 s3

x1 x2 x3

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1 s2 s3 sn

x1 x2 x3 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1 s2 s3 sn In general,

n = time step

x1 x2 x3 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1 s2 s3 sn In general,

n = time step
 Same function is used

x1 x2 x3 xn
 Replicate network any number of times
 Ensure parameter sharing
 Number of timestep does not matter

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Handling Individual Time Steps

s1 s2 s3 sn

How to maintain the


interdependency between input?

x1 x2 x3 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1
A Simple Approach

x1
Let’s consider one approach

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1 s2
A Simple Approach

x1 x1 x2
Let’s consider one approach

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1 s2
A Simple Approach

x1 x1 x2
Let’s consider one approach
s3

x1 x2 x3

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1 s2
A Simple Approach

x1 x1 x2
Let’s consider one approach
s3 s4

x1 x2 x3 x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1 s2
A Simple Approach

x1 x1 x2
Will this approach work?
s3 s4

x1 x2 x3 x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1 s2
A Simple Approach
Problem
 Different function for different time-
step
x1 x1 x2 s1 = f1(x1)
s3 s4
s2 = f2(x1,x2)
s3 = f3(x1,x2,x3) ……

x1 x2 x3 x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


s1 s2
A Simple Approach
Problem
 Different function for different time-
step
x1 x1 x2 s1 = f1(x1)
s3 s4
s2 = f2(x1,x2)
s3 = f3(x1,x2,x3) ……
 Depends on input length

x1 x2 x3 x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network

s1 s2 s3 s4 sn
Solution
Add recurrent connection

h1 h2 h3 h4 hn

x1 x2 x3 x4 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network

s1 s2 s3 s4 sn Solution
Add recurrent connection

h1 h2 h3 h4 hn

x1 x2 x3 x4 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network

s1 s2 s3 s4 sn Solution
Add recurrent connection

h1 h2 h3 h4 hn

x1 x2 x3 x4 xn
input

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network

s1 s2 s3 s4 sn Solution
Add recurrent connection

h1 h2 h3 h4 hn

x1 x2 x3 x4 xn output
input

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network

s1 s2 s3 s4 sn Solution
Add recurrent connection

h1 h2 h3 h4 hn

x1 x2 x3 x4 xn output
input past memory

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network

s1 s2 s3 s4 sn snn

hn

h1 h2 h3 h4 hn

x1 x2 x3 x4 xn xnn

Can be represented
more compactly

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network
A sequence of vectors is processed by applying a
snn
recurrence formula at each time step.
hn

xnn
 Same function is used
 Ensure parameter sharing
 Handles the temporal dependency between
sequence
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Recurrent Neural Network Architectures
one to one

Vanilla NN

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network Architectures
one to one one to many

Vanilla NN Image
Captioning

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network Architectures
one to one one to many many to one

Vanilla NN Image Sentiment


Captioning Classification

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network Architectures
one to one one to many many to one many to many

Vanilla NN Image Sentiment Name Entity


Captioning Classification recognition

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Recurrent Neural Network Architectures
one to one one to many many to one many to many many to many

Vanilla NN Image Sentiment Name Entity Machine


Captioning Classification recognition Translation

Recurrent Neural Network


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep Learning
Course Code:

Unit 4: Sequential models &


Recurrent Neural Networks (RNNs)

Lecture 2: Introduction to RNNs


and their applications

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

Source: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

Source: https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

Whh
h0
Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

Activation Bias
Whh function
h0 h1
Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

s1
Wsh
Whh
h0 h1
Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

s1 s2
Wsh Wsh
Whh Whh
h0 h1 h2
Wxh Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

s1 s2 s3
Wsh Wsh Wsh
Whh Whh Whh
h0 h1 h2 h3
Wxh Wxh Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Forward Propagation

s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Weight matrix , and remains same throughout


the forward propagation, thus ensuring parameter sharing
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Actual outputs
y1 y2 y3 y4
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Back Propagation Through Time (BPTT)
Actual outputs
y1 y2 y3 y4
Loss calculation:
s1 s2 s3 s4
Loss
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh = Loss function
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh In general:
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
sh sh
Wsh Wsh Wsh Wsh
Whh Whh Whh Whh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step Gradient calculation wrt :
4 4
L1 L2 L3 L4
4

hh 4 hh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
hh
Whh Whh Whh Whh
h0 h1 h2 h3 h4 Now,
Wxh Wxh Wxh Wxh
Simply,
z4
Then, z4
x1 x2 x3 x4 hh hh
h3
h3
hh
Assumptions:
h2
and = least square function = h3 h2
hh

Amity Centre for Artificial Intelligence, Amity University, Noida, India


RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
4
s1 s2 s3 s4
hh
Wsh Wsh Wsh Wsh g h3 h2
sh
h0
Whh Whh Whh Whh
h0 h1 h2 h3 h4 hh

Wxh Wxh Wxh Wxh Weight updation wrt :


x1 x2 x3 x4 hh hh
hh

Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step Gradient calculation wrt :
4 4
L1 L2 L3 L4
4

xh 4 xh
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
sh
xh
Whh Whh Whh Whh
h0 h1 h2 h3 h4 Now,
Wxh Wxh Wxh Wxh Simply,
z4
x1 x2 x3 x4 Then, z4
xh xh

Whh.h3
x4
Assumptions: xh
z2
and = least square function = x4 Whh z2
xh
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
z2
sh x4 Whh z2
Whh Whh Whh Whh xh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4

Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
RNN: Back Propagation Through Time (BPTT)
Loss at each time step
L1 L2 L3 L4 Gradient calculation wrt :
s1 s2 s3 s4
Wsh Wsh Wsh Wsh
z2
sh x4 Whh z2
Whh Whh Whh Whh xh
h0 h1 h2 h3 h4
Weight updation wrt :
Wxh Wxh Wxh Wxh
x1 x2 x3 x4
tf.keras.layers.SimpleRNN(rnn_units)
Assumptions:
and = least square function =
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.
 This may lead to:
Exploding Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.
 This may lead to:
Exploding Gradient Vanishing Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.
 This may lead to:
Exploding Gradient Vanishing Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.
 This may lead to:
Exploding Gradient Vanishing Gradient

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.
 This may lead to:
Exploding Gradient Vanishing Gradient

 make learning unstable

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Limitations of RNN
 Gradient calculation involves many factors of weights and contribution of
activation function.
 This may lead to:
Exploding Gradient Vanishing Gradient

 make learning unstable  Short term dependencies


“the stars shine in the ?”  sky (RNN works
good here)
 Long term dependencies
“I grew up in Spain…........…………………… I speak
fluent Spanish”. (Difficult for RNN to remember
as gap increases)
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Possible Solutions

Exploding Gradient Vanishing Gradient

 Gradient clipping

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Possible Solutions

Exploding Gradient Vanishing Gradient

 Gradient clipping  Activation function (Relu)


 Weight initialization (identity
matrix)
 Gated cells (LSTM,GRU,etc)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Deep Learning
Course Code:

Unit 4: Sequential models &


Recurrent Neural Networks (RNNs)

Lecture 3: Long Short-Term


Memory (LSTM) networks

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)
 A special kind of RNN, capable of learning long-term dependencies.
 Introduced by Hochreiter & Schmidhuber (1997).
 Keeping relevant information for long period of time is their default
behavior.
 Have been refined and popularized by many researchers.
 Successfully applied in many problems that have sequential behavior.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Selective Read, Selective Write, Selective Forget
– The Whiteboard Analogy

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Problems with RNN

 sn holds information of previous time


s1 s2 s3 sn
steps
Wsh Wsh Wsh Wsh

h1 h2 h3 hn

Wxh Wxh Wxh Wxh


x1 x2 x3 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Problems with RNN

 sn holds information of previous time


s1 s2 s3 sn
steps
Wsh Wsh Wsh Wsh

h1 h2 h3 hn  Information stored at time step n-k (for


some k<n) gets completely morphed
Wxh Wxh Wxh Wxh
x1 x2 x3 xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Problems with RNN

 sn holds information of previous time


s1 s2 s3 sn
steps
Wsh Wsh Wsh Wsh

h1 h2 h3 hn  Information stored at time step n-k (for


some k<n) gets completely morphed
Wxh Wxh Wxh Wxh
x1 x2 x3 xn  Similar problem when going backwards
(backpropagation)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Whiteboard Analogy

Let us see an analogy for this

Image source:https://prvnk10.medium.com/the-whiteboard-analogy-to-deal-vanishing-and-exploding-gradients-1c0d47bfd6e1

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Whiteboard Analogy

 Selectively write

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Whiteboard Analogy

 Selectively write

 Selectively read

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Whiteboard Analogy

 Selectively write

 Selectively read

 Selectively forget

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively write
time.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively write
time.

𝑎𝑐 = 17

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively write
time.

𝑎𝑐 = 17
𝑏𝑑 = 50

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively read
time.

𝑎𝑐 = 17
𝑏𝑑 = 50

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively read
time.

𝑎𝑐 = 17
𝑏𝑑 = 50
𝑏𝑑 + 𝑎 = 52

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.

𝑎𝑐 = 17
𝑏𝑑 = 50
𝑏𝑑 + 𝑎 = 52

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.

𝑎𝑐 = 17

𝑏𝑑 + 𝑎 = 52

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.

𝑎𝑐 = 17
𝑎𝑐(𝑏𝑑 + 𝑎) = 884
𝑏𝑑 + 𝑎 = 52

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.

𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728
𝑎𝑑 = 20

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.

𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728
𝑎𝑑 = 20

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Derive an Expression on Whiteboard

Compute
Say “board” can have only 3 statements at a  Selectively forget
time.

RNN has finite state size.


𝑎𝑑 + 𝑎𝑐 𝑏𝑑 + 𝑎 = 748
𝑎𝑐(𝑏𝑑 + 𝑎) = 728 Thus, we need selective read,
𝑎𝑑 = 20 selective write and selective forget!!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Understand the concept using real-time example
+/-

What is the sentiment of the review?

x1 x2 x3 xn
The First ... performance

Review: The first half of the movie was dry but the
second half really picked up pace. The lead actor
delivered an amazing performance.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Understand the concept using real-time example
+/-

What is the sentiment of the review?


 Selectively write
x1 x2 x3 xn
The First ... performance  Selectively read
Review: The first half of the movie was dry but the
second half really picked up pace. The lead actor  Selectively forget
delivered an amazing performance.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Understand the concept using real-time example
+/-

What is the sentiment of the review?


 Selectively write
x2 x3
Helps to
x1 xn
store only
The First ... performance  Selectively read
important
Review: The first half of the movie was dry but the information
second half really picked up pace. The lead actor  Selectively forget
delivered an amazing performance.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)
sn

 Computational block Cn-1 × + Cn


tanh
fn
×
 Track information ×
σ σ tanh σ
 Maintain a cell state hn-1 hn

 Use Gates xn

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)

How do LSTM work?


a) Input
b) Forget
c) Update
d) Output

Source: https://medium.com/analytics-vidhya/tagged/lstm

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)
sn
How do LSTMs work?
a) Forget Cn-1 × +
tanh
b) Input fn in
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Forget gate gets rid of xn


irrelevant information

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)
sn
How do LSTMs work?
a) Forget Cn-1 × +
tanh
b) Input fn in
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Forget gate gets rid of xn


irrelevant information

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)
sn
How do LSTMs work?
a) Forget Cf
Cn-1 × +
tanh
b) Input fn in
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Forget gate gets rid of xn


irrelevant information

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
b) Input in tanh
×
gn ×
c) Update σ σ tanh σ
d) Output hn-1

Input gate stores relevant xn


information from current
input
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
b) Input in tanh
×
gn ×
c) Update σ σ tanh σ
d) Output hn-1

Input gate stores relevant xn


information from current
input
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 ×
Cf
+
in Ci tanh
b) Input ×
gn ×
c) Update σ σ tanh σ
d) Output hn-1

Input gate stores relevant xn


information from current
input
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 × Cf + Cn
b) Input Ci tanh
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Update gate selectively xn


update cell state value

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 × Cf + Cn
b) Input Ci tanh
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Update gate selectively xn


update cell state value

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Output gate returns a xn


filtered version of the
cell state
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1

Output gate returns a xn


filtered version of the
cell state
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Long Short-Term Memory (LSTM)

sn
How do LSTMs work?
a) Forget Cn-1 × + Cn
tanh
b) Input on
× ×
c) Update
σ σ tanh σ
d) Output hn-1 hn

Output gate returns a xn


filtered version of the
cell state
Amity Centre for Artificial Intelligence, Amity University, Noida, India
LSTM Gradient Flow
s1 s2 s3

× + × + × +
tanh tanh tanh C3
C0 fn fn fn
× × × × × ×
σ σ tanh σ σ σ tanh σ σ σ tanh σ

x1 x2 x3

Uninterrupted gradient flow

Amity Centre for Artificial Intelligence, Amity University, Noida, India


LSTM Gradient Flow

BPTT in LSTM is similar to BPTT in RNN.


Complexity of the derivatives increases due to presence of gates
Detailed information on BPTT of LSTM can be found at
https://kartik2112.medium.com/lstm-back-propagation-behind-the-scenes-andrew-
ng-style-notations-7207b8606cb2
tf.keras.layers.LSTM(lstm_units)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India

You might also like