DL Full Merged

Deep Learning
Introduction to Deep Learning
By
Dr. Kumud Tripathi

Grading:
Theory - 150 Marks Lab - 50 Marks
● Mid Sem - 40 ● Assignments - 20

● End Sem - 70 ● Group Project - 30
● Quiz - 20 ○ Project work - 20
● Presentation - 20 ○ Individual Presentation - 10
Note: 3~4 students per group

Content:
1. Introduction
a. What is Deep Learning?
b. Why Deep Learning?
c. Fields where Deep Learning is used
d. Difference between Deep Learning and Machine Learning
2. Overview of supervised, unsupervised, reinforcement learning
3. Difference between classification, regression
What is Deep Learning?
Deep learning is a collection of
statistical techniques of
machine learning for learning
feature hierarchies that are
actually based on artificial
neural networks.
Image Source: Google

Why Deep Learning?
1. Deep Learning outperform other techniques if the data size is large. But with small data size,
traditional Machine Learning algorithms are preferable.
2. Deep Learning techniques need to have high end infrastructure to train in reasonable time.
3. When there is lack of domain understanding for feature introspection, Deep Learning
techniques outshines others as you have to worry less about feature engineering.
4. Deep Learning really shines when it comes to complex problems such as image
classification, natural language processing, and speech recognition.
Deep Learning Applications
1. Self-Driving Cars
2. Voice Controlled Assistance
3. Automatic Image Caption Generation
4. Automatic Machine Translation
Deep Learning Vs Machine Learning
Image Source: Kaggle

Supervised, unsupervised, and
reinforcement learning
Criteria Supervised ML Unsupervised ML Reinforcement ML
Definition Learns by using labelled Trained using unlabelled Works on interacting

data data without any with the environment
guidance.
Type of data Labelled data Unlabelled data No – predefined data
Type of problems Regression and Clustering Exploitation or

classification Exploration
Supervision Extra supervision No supervision No supervision

Criteria Supervised ML Unsupervised ML Reinforcement ML
Algorithms Linear Regression, K – Means, Q – Learning,

Logistic Regression, SVM,
KNN etc. C – Means, Apriori SARSA
Aim Calculate outcomes Discover underlying Learn a series of action

patterns
Application Risk Evaluation, Forecast Recommendation Self Driving Cars,

Sales System, Anomaly Gaming, Healthcare
Detection
Classification Vs Regression
Parameter CLASSIFICATION REGRESSION
The mapping function is used for mapping Mapping Function is used for the mapping
Basic
values to predefined classes. of values to continuous output.
Involves
Discrete values Continuous values
prediction of
Nature of the
Unordered Ordered
predicted data
Method of
by measuring accuracy by measurement of root mean square error
calculation
Example
Decision tree, logistic regression, etc. Random forest, Linear regression, etc.
Algorithms
Deep Learning
Perceptron
By
Dr. Kumud Tripathi

Content:
1. Biological Neuron and Artificial Neuron
2. Perceptron and its type
Biological and Artificial Neurons
Biological and Artificial Neurons
Artificial Neuron Characteristics
● A neuron is a mathematical function modeled on the working of biological neurons
● It is an elementary unit in an artificial neural network
● One or more inputs are separately weighted
● Inputs are summed and passed through a nonlinear function to produce output
● Every neuron holds an internal state called activation signal
● Each connection link carries information about the input signal
● Every neuron is connected to another neuron via connection link
Perceptron
● A Perceptron is an algorithm for supervised learning of binary classifiers. This algorithm enables
neurons to learn and processes elements in the training set one at a time.
Types of Perceptron
1. Single layer: Single layer perceptron can learn only linearly separable patterns.
2. Multilayer: Multilayer perceptrons can learn about two or more layers having a greater processing
power.
Note: The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision
boundary.
How Does Perceptron Work?
● Weights shows the strength of the particular node.
● A bias value allows you to shift the activation function curve up or down.
How Does Perceptron Work?
1. The weights are initialized with the random values at the origination of each training.
2. Multiply all input values with corresponding weight values and then add to calculate the weighted
sum. The following is the mathematical expression of it:
a. ∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4
3. An activation function is applied with the above-mentioned weighted sum giving us an output in
binary form as follows:
a. Y=f(∑wi*xi + b)
4. For each element of the training set, the error is calculated with the difference between the
desired output and the actual output. The calculated error is used to adjust the weight.
5. The process is repeated until the fault made on the entire training set is less than the specified
limit until the maximum number of iterations has been reached.
Perceptron Algorithm Training Procedure
1. Initialize our weight vector w with small random values
2. Until Perceptron converges:
a. Loop over each feature vector xj and true class label di in our training set D
b. Take x and pass it through the network, calculating the output value: yj = f(w(t) ·
xj)
c. Update the weights w: wi(t +1) = wi(t) +α(dj −yj)xj,i for all features 0 <= i <= n
Activation Function of Perceptron Model
● Activation functions are used to map the input between the required values like (0, 1) or (-1, 1).
Limitation of Perceptron Model
1. The output of a perceptron can only be a binary number (0 or 1) due to the hard-edge transfer
function.
2. It can only be used to classify the linearly separable sets of input vectors. If the input vectors are
non-linear, it is not easy to classify them correctly.
Implementing Basic Logic Gates With
Perceptron
1. AND
If the two inputs are TRUE (+1), the output of Perceptron is positive, which amounts to TRUE.
This is the desired behavior of an AND gate.
x1= 1 (TRUE), x2= 1 (TRUE)
w0 = -.8, w1 = 0.5, w2 = 0.5
=> o(x1, x2) => -.8 + 0.5*1 + 0.5*1 = 0.2 > 0

Perceptron
2. OR
If either of the two inputs are TRUE (+1), the output of Perceptron is positive, which amounts to TRUE.
This is the desired behavior of an OR gate.
x1 = 1 (TRUE), x2 = 0 (FALSE)
w0 = -.3, w1 = 0.5, w2 = 0.5
=> o(x1, x2) => -.3 + 0.5*1 + 0.5*0 = 0.2 > 0

Perceptron
3. XOR
Deep Learning
Multilayer Perceptron
By
Dr. Kumud Tripathi

Content:
1. Example XOR
2. MLP
3. Backpropagation
Example of XOR Gate
● Y=(A⨁B)
● Y=(A' B+AB')
● Y=(A'B+AB') + (A’A+B’B)
● Y=(A’A+AB’) + (A’B+B’B)
● Y=(A’+B’)(A+B)
● Y=(AB)’(A+B)
Example of XOR Gate
MLP
● Composed of several Perceptron-like units arranged in

multiple layers
● Consists of an input layer, one or more hidden layers, and an
output layer
● Nodes in the hidden layers compute a nonlinear transform of
the inputs
● Also called a Feedforward Neural Network
What do Hidden Layers Learn?
● Hidden layers can automatically extract features from data

● The bottom-most hidden layer captures very low level features (e.g., edges). Subsequent
hidden layers learn progressively more high-level features (e.g., parts of objects) that are
composed of previous layer’s features
What do Hidden Layers Learn?
Steps for Implementing MLP
1. Feedforward:
On a feedforward neural network, we have a set of input features and some random weights. Notice
that in this case, we are taking random weights that we will optimize using backward propagation.
2. Backpropagation:
Backpropagation is an algorithm for update the weights and biases of a model based on their gradients
with respect to the error function, starting from the output layer all the way to the first layer.
● Activation functions should be differentiable, so that a network’s parameters can be updated

using backpropagation.
Gradient Descent
● Gradient:
○ A gradient measures how much the output of a function changes if you change the inputs a
little bit.
○ In machine learning, a gradient is a derivative of a function also Known as the slope of a
function in mathematical terms.
● Gradient Descent
○ Gradient Descent is an optimization algorithm for finding a local minimum of a
differentiable function.
○ The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration.
● The cost function is defined as the measurement of difference or error between actual values
and expected values.
How Gradient Descent Works?
● The equation below describes what

gradient descent does:
○ b is the updated weight, while a
represents current weight.
○ The minus sign refers to the
minimization part of gradient
descent.
○ The gamma is the learning rate, and
○ the gradient term ( Δf(a) ) is simply
the direction of the steepest
descent.
Importance of the Learning Rate
Importance of the Learning Rate
● It is defined as the step size taken to reach the minimum or lowest point.
● This is typically a small value that is updated based on the behavior of the cost function.
● If the learning rate is high, it results in larger steps but also leads to risks of overshooting the
minimum.
● A low learning rate shows the small step sizes, which compromises overall efficiency but gives the
advantage of more precision.
Deep Learning
Multilayer Perceptron Training
By
Dr. Kumud Tripathi

Backpropagation Algorithm
Forward Pass
● To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
● To calculate the final result of H1, we performed the sigmoid function as

Forward Pass
● We will calculate the value of H2 in the same way as H1
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
● To calculate the final result of H1, we performed the sigmoid function as

Forward Pass
● Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
● Now Y1(Final):
● Now Y2 = 1.2249214 and

Forward Pass
● Total Error:
● So the total Error is,
● Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer
● To update the weight, we calculate the error correspond to each weight with the help of a total
error.
● The error on weight w is calculated by differentiating total error with respect to w.
● We perform backward process so first consider the last weight w5 as

● Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
So, we put the values of in equation no (3) to find the final result.
Now, we will calculate the updated weight w5new with the help of the following formula
Deep Learning
Different activation functions their advantages and

disadvantages
By
Dr. Kumud Tripathi

Activation Function
Activation Function
● Why Do We Need Activation Function?
● Why can’t we switch this signal to the output without activating it?
● So Why Do Nonlinear Functions Needed?

Activation Function
● Why Do We Need Activation Function?

○ to bring non-linearities into the decision border to solve nonlinear real-world
properties using artificial neural networks
● Why can’t we switch this signal to the output without activating it?
○ Without activation function the output signal becomes a simple linear function.
● So Why Do Nonlinear Functions Needed?

○ Without a non-linear activation function in your NN, no matter how many layers it had, it
will behave just like a single-layer perceptron, because summing these layers would give
you just another linear function.
Types of Activation Function
1. Step Function
2. Sigmoid Function
3. Tanh Function
4. ReLU Function
5. Leaky ReLU Function
6. Softmax Function
Step Function
● It is a function that takes a binary value and is used as a binary classifier.

● Therefore, it is generally preferred in the output layers.
● It is not recommended to use it in hidden layers because it does not represent
derivative learning value and it will not appear in the future.
Sigmoid Function
● Sigmoid transforms the values between the range 0 and 1.
● The Mathematical function of the sigmoid function is:
● Derivative of the sigmoid is:

Sigmoid Function
Advantages:
● The output value is between 0 and 1.
● The prediction is simple, ie based on a threshold probability value.
Disadvantages:
● Computationally expensive
● Outputs not zero centered
● Vanishing gradient—for very high or very low values of X, there is almost no change to the
prediction, causing a vanishing gradient problem. This can result in the network refusing to learn
further, or being too slow to reach an accurate prediction.
Hyperbolic Tangent (tanh) Function
● The tanh function is similar to the sigmoid function. The output ranges from -1 to 1.
● The Mathematical function of tanh function is:
g
● Derivative of tanh function is:

Hyperbolic Tangent (tanh) Function
Advantages:
● Zero Centered
● The prediction is simple, ie based on a threshold probability value.
Disadvantages
● More computation expensive than sigmoid function.

● Suffers with gradient vanishing.
Rectified Linear Unit (ReLU) Function
● The rectified linear activation function (RELU) is a piecewise linear function that, if the input is
positive say x, the output will be x. otherwise, it outputs zero.
● The mathematical representation of ReLU function is,
● The derivative of ReLU is,
● Note: If we have input less than 0, then it outputs zero, and the neural network can't continue the
backpropagation algorithm. This problem is commonly known as Dying ReLU. To get rid of this
problem we use an improvised version of ReLU, called Leaky ReLU.
Rectified Linear Unit (ReLU) Function
Advantages:
● No gradient vanishing
● Derivative is constant
● Computationally efficient
Disadvantages:
● Dying ReLU problem i.e for the inputs 0 or negative the gradient of ReLU becomes zero and thus
the network cannot make backpropagation.
Leaky Rectified Linear Unit (Leaky ReLU)
Function
● Leaky ReLU is the most common and effective method to solve a dying ReLU
problem.
● It is nothing but an improved version of the ReLU function.
● It adds a slight slope in the negative range to prevent the dying ReLU issue.
● The mathematical representation of Leaky ReLU is,
● The Derivative of Leaky ReLU is,

Leaky Rectified Linear Unit (Leaky ReLU)
Function
Advantages:
● Modification of the ReLU function to solve the dying ReLU problem.
Disadvantages:
● Leaky ReLU does not provide consistent predictions for negative input values.
Softmax Function
● Softmax function is often described as a combination of multiple sigmoids.

● The softmax function can be used for multiclass classification problems.
● The probabilities sum will be 1
● commonly found in the output layer
● This function returns the probability for a datapoint belonging to each individual class.
● Here is the mathematical expression of the same-
Softmax Function
Deep Learning
Loss and Cost Functions
By
Dr. Kumud Tripathi

Difference between Loss and Cost Functions
● loss/Cost function in Machine learning helps us understand the difference between the predicted
value & the actual value.
● But, the Loss function is associated with every training example, and the cost function is the
average value of the loss function over all the training samples.
● In Machine learning, we usually try to optimize our cost function rather than loss function.
Types of the Cost Functions
Cost functions can be of various types depending on the problem. However, mainly it is of two types, which are
as follows:
1. Regression Cost Function

2. Classification cost Functions
Regression Cost Functions
● Regression models deal with predicting a continuous value for example salary of an employee,
price of a car, loan prediction, etc.
● They are calculated on the distance-based error as follows:
Error = y-y’
● Where,
Y – Actual Input
Y’ – Predicted output
Regression Cost Functions
● The most used Regression cost functions are below,
a. Mean Squared Error
b. Mean Absolute Error
Mean Squared Error (MSE)
● It is measured as the average of the sum of squared differences between predictions and actual
observations.
● Here a square of the difference between the actual and predicted value is calculated to avoid
any possibility of negative error.
● It is also known as L2 loss.
● In MSE, since each error is squared, it helps to penalize even small deviations in prediction
when compared to MAE.
● But if our dataset has outliers that contribute to larger prediction errors, then squaring this
error further will magnify the error many times more and also lead to higher MSE error.
● Hence we can say that it is less robust to outliers.
Mean Squared Error (MSE)
(a) Without Outlier (b) With Outlier
Mean Absolute Error (MAE)
● MAE is measured as the average of the sum of absolute differences between predictions and
actual observations.
● Here an absolute difference between the actual and predicted value is calculated to avoid any
possibility of negative error.
● It is also known as L1 Loss.
● It is robust to outliers thus it will give better results even when our dataset has noise or
outliers.
Mean Absolute Error (MAE)
(a) Without Outlier (b) With Outlier
Classification Cost Functions
● The most used classification cost functions are below,
a. Cross Entropy Loss
b. KL Divergence Loss
c. Hinge Loss
Cross Entropy Loss
● Entropy:
○ Entropy signifies uncertainty.
○ The greater the value of entropy H(x) , the greater the uncertainty for probability
distribution and the smaller the value the less the uncertainty.
○ If the entropy is higher, the surety of the probability distribution function will be lesser, and
when the entropy is lower, the surety will be higher.
○ For a random variable X, having probability distribution as p(X), entropy is defined as:
Cross Entropy Loss
● Cross Entropy:
○ Also called logarithmic loss, log loss or logistic loss.
○ Each predicted class probability is compared to the actual class desired output 0 or 1 and a
loss is calculated that penalizes the probability based on how far it is from the actual
expected value.
○ The penalty is logarithmic in nature yielding a large score for large differences close to 1
and small score for small differences tending to 0.
○ A perfect model has a cross-entropy loss of 0.
○ Cross-entropy is defined as
Cross Entropy Loss
Cross Entropy Loss
● Example:
○ The cross-entropy is computed as follows
KL divergence Loss
● The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much one
probability distribution differs from another probability distribution.
● Lower the KL divergence value, the better we have matched the true distribution with our
approximation.
● The KL Divergence is not symmetric: that is
As a result, it is also not a distance metric.

KL divergence Loss
● It represents the difference between cross entropy and entropy.
Hinge Loss
● This special loss function is only used with Support Vector Machines or Maximal Margin
Classifiers, having classes -1 and 1 ( Not 0 and 1).
● SVM is a machine learning algorithm specially used for binary classification and uses
decision boundaries to separate two classes.
Deep Learning
Dataset Splitting, Bias vs Variance trade-off
By
Dr. Kumud Tripathi

Data Splitting
● In most supervised machine learning tasks, best practice recommends to split your data into
three independent sets:
a. a training set,
b. a testing set, and
c. a validation set.
Training Dataset
● The sample of data used to fit the model.

● The actual dataset that we use to train the model (weights and biases in the case of a
Neural Network).
● The model sees and learns from this data.
Validation Dataset
● The sample of data used to provide an unbiased evaluation of a model fit on

the training dataset while tuning model hyperparameters.
● We use this data to fine-tune the model hyperparameters.
● Hence the model occasionally sees this data, but never does it “Learn” from this.
● We use the validation set results, and update higher level hyperparameters.
● The validation set is also known as the Dev set or the Development set.
Test Dataset
● The sample of data used to provide an unbiased evaluation of a final model

fit on the training dataset.
● It contains carefully sampled data that spans the various classes that the model
would face, when used in the real world.
Dataset Split Ratio
● This mainly depends on 2 things.

○ First, the total number of samples in your data and
○ second, on the actual model you are training.
Dataset Split Ratio
● Some models need substantial data to train upon, so in this case you would
optimize for the larger training sets.
● Models with very few hyperparameters will be easy to validate and tune, so you can
probably reduce the size of your validation set, but if your model has many
hyperparameters, you would want to have a large validation set as well.
● Also, if you have a model with no hyperparameters or ones that cannot be easily
tuned, you probably don’t need a validation set too!
Bias vs Variance trade-off
● Let there be n training points and m test (validation) points
● As the model complexity increases trainerr becomes overly optimistic and gives us a
wrong picture of how close fˆ is to f
● The validation error gives the real picture of how close fˆ is to f
Deep Learning
Regularization: Data Augmentation, Early Stopping
By
Dr. Kumud Tripathi

Data Augmentation
● This is used to prevent overfitting.
● Data augmentation is a set of techniques to artificially increase the amount of data by generating
new data points from existing data.
● This includes making small changes to data or using deep learning models to generate new data
points.
How does it work?
For image classification and segmentation
● padding
● random rotating
● re-scaling,
● vertical and horizontal flipping
● translation ( image is moved along X, Y direction)
● cropping
● zooming
● darkening & brightening/color modification
● grayscaling
● changing contrast
● adding noise
● random erasing
For image classification and segmentation
Advance models for data augmentation
● Generative adversarial networks (GANs): GAN algorithms can learn patterns from input datasets
and automatically create new examples which resemble training data.
● Neural style transfer: Neural style transfer models can blend content image and style image and
separate style from content.
● Reinforcement learning: Reinforcement learning models train software agents to attain their
goals and make decisions in a virtual environment.
What are the benefits of data augmentation?
● Improving model prediction accuracy

○ adding more training data into the models
○ preventing data scarcity for better models
○ reducing data overfitting and creating variability in data
○ increasing generalization ability of the models
● Reducing costs of collecting and labeling data
● Enables rare event prediction
● Prevents data privacy problems
Early Stopping
● A significant challenge when training a machine learning model is deciding how many epochs to
run. Too few epochs might not lead to model convergence, while too many epochs could lead to
overfitting.
● Early stopping is an optimization technique used to reduce overfitting without compromising on
model accuracy. The main idea behind early stopping is to stop training before a model starts to
overfit.
Early Stopping
Early Stopping Approaches
1. Training model on a preset number of epochs

2. Stop when the loss function update becomes small
3. Validation set strategy
Deep Learning
Cross Validation and Regularization: L2 and L1 Regularization
By
Dr. Kumud Tripathi

Cross-Validation Technique
● Validation
○ In this method, we perform training on the 50% of the given data-set and rest 50%
is used for the testing/validation purpose.
○ The major drawback of this method is that we perform training on the 50% of the
dataset, it may possible that the remaining 50% of the data contains some
important information which we are leaving while training our model i.e higher bias.
● LOOCV (Leave One Out Cross Validation)
○ In this method, we perform training on the whole data-set but leaves only one
data-point of the available data-set and then iterates for each data-point.
○ An advantage of using this method is that we make use of all data points and hence
it is low bias.
○ Disadvantage of this technique; that is, it can be computationally difficult.
● K-Fold Cross Validation
○ This approach divides the input dataset into K
groups of samples of equal sizes (Folds).
○ For each learning set, the prediction function
uses k-1 folds, and the rest of the folds are used
for the test set.
Advantages of train/test split:
1. This runs K times faster than Leave One Out cross-validation because K-fold
cross-validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and
testing.
Model Overfitting
L1 Regularization
● Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds
“Absolute value of magnitude” of coefficient, as penalty term to the loss
function.
● Lasso shrinks the less important feature’s coefficient to zero; thus, removing some
feature altogether.
● So, this works well for feature selection in case we have a huge number of
features.
● L1 regularization is robust in dealing with outliers.
L2 Regularization
● The Regression model that uses L2 regularization is called Ridge Regression.
● Regularization adds the penalty as model complexity increases.

● Ridge regression adds “squared magnitude of the coefficient” as penalty term
to the loss function.
● In L1 and L2 Regularization, Lamda is predicted using Cross-validation
technique.
L2 Regularization
● Ridge regularization forces the weights to be small but does not make
them zero and does not give the sparse solution.
● Ridge is not robust to outliers
● Ridge regression performs better when all the input features influence the output,
and all with weights are of roughly equal size.
● L2 regularization can learn complex data patterns
Deep Learning
Regularization: Ensembling and Dropout
By
Dr. Kumud Tripathi

Ensemble Method
● Ensemble learning helps improve machine learning results by combining several models.
● This approach allows the production of better predictive performance compared to a
single model.
● Basic idea is to learn a set of classifiers and to allow them to vote.
Ensemble Method
How to Ensemble Neural Network Models
It can be helpful to think of varying each of the three major elements of the ensemble method;
for example:
● Training Data: Vary the choice of data used to train each model in the ensemble.
● Ensemble Models: Vary the choice of the models used in the ensemble.
● Combinations: Vary the choice of the way that outcomes from ensemble members are
combined.
Bagging Approach
● It is a type of ensemble method.
● This approach is called bootstrap aggregation, and was designed for use with decision
trees that have high variance and low bias.
● Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.
Bagging Approach
Advantages of Ensemble Method
1. Ensemble methods have higher predictive accuracy, compared to the individual

models.
2. Ensemble methods are very useful when there is both linear and non-linear type of
data in the dataset; different models can be combined to handle this type of data.
3. With ensemble methods bias/variance can be reduced and most of the times, model
is not underfitted/overfitted.
4. Ensemble of models is always less noisy and is more stable.
Disadvantages of Ensemble Method
1. Ensembling is less interpretable, the output of the ensembled model is hard to

predict and explain.
2. The art of ensembling is hard to learn and any wrong selection can lead to lower
predictive accuracy than an individual model.
3. Ensembling is expensive in terms of both time and space.
Dropout Technique
Dropout Technique
Dropout Technique
Dropout Technique
Dropout Technique
Dropout Technique
Dropout Technique
Deep Learning
Optimization Algorithms
By
Dr. Kumud Tripathi

Optimization Algorithm
● An optimization algorithm finds the value of the parameters (weights) that
minimize the error when mapping inputs to outputs.
● These optimization algorithms or optimizers widely affect the accuracy of the
deep learning model.
● An optimizer is a function or an algorithm that modifies the attributes of the neural
network, such as weights and learning rate.
Terminologies in Deep Learning
● Epoch: An epoch is defined as a single training iteration of all batches in both forward
and back propagation. This means 1 epoch is a single forward and backward pass of the
entire input data.
● Batches: It denotes the number of samples to be taken to for updating the model
parameters.
● Learning rate: It is a parameter that provides the model a scale of how much model
weights should be updated.
GD Optimization Algorithm
● The main objective of using a gradient descent algorithm is to
minimize the cost function using iteration.
● Calculates the first-order derivative of the function to compute the gradient or slope
of that function.
● Move away from the direction of the gradient, which means slope increased from
the current point by alpha times, where Alpha is defined as Learning Rate.
● It is a tuning parameter in the optimization process which helps to decide the
length of the steps.
GD Optimization Algorithm
● Based on the error in various training models, the Gradient Descent learning algorithm
can be divided into:
a. Batch gradient descent,
b. Stochastic gradient descent, and
c. Mini-batch gradient descent.
Batch GD Optimization Algorithm
● This is a type of gradient descent which processes all the training examples for each
iteration of gradient descent.
● But if the number of training examples is large, then batch gradient descent is
computationally very expensive.
● Hence if the number of training examples is large, then batch gradient descent is
not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch
gradient descent.
Stochastic GD Optimization Algorithm
● This is a type of gradient descent which processes 1 training example per iteration.
● Hence, the parameters are being updated even after one iteration in which only a
single example has been processed.
● Hence, this is quite faster than batch gradient descent.
● But again, when the number of training examples is large, even then it processes
only one example which can be additional overhead for the system as the number of
iterations will be quite large.
Mini-Batch GD Optimization Algorithm
● This is a type of gradient descent which works faster than both batch gradient
descent and stochastic gradient descent.
● Here b examples where b<m are processed per iteration.
● So even if the number of training examples is large, it is processed in batches of b
training examples in one go.
● Thus, it works for larger training examples and that too with lesser number of
iterations.
Convergence trends in different variants of GD:
● In case of Batch GD, the algorithm follows a straight path towards the global
minimum. Here the learning rate is typically held constant.
● In case of stochastic GD and mini-batch GD, the algorithm does not converge but
keeps on fluctuating around the global minimum.
● Therefore in order to make it converge, we have to slowly change the learning rate.
● However the convergence of Stochastic gradient descent is much noisier as in one
iteration, it processes only one training example.
● In case of Batch GD, the algorithm follows a straight path towards the global
minimum. Here the learning rate is typically held constant.
● In case of stochastic GD and mini-batch GD, the algorithm does not converge but
keeps on fluctuating around the global minimum.
● Therefore in order to make it converge, we have to slowly change the learning rate.
● However the convergence of Stochastic gradient descent is much noisier as in one
iteration, it processes only one training example.
GD with momentum
● An Adaptive Optimization Algorithm uses exponentially weighted averages of

gradients over previous iterations to stabilize the convergence, resulting in quicker
optimization.
● For example, in most real-world applications of Deep Neural Networks, the training
is carried out on noisy data. It is, therefore, necessary to reduce the effect of noise
when the data are fed in batches during Optimization.
● This problem can be tackled using Exponentially Weighted Averages (or
Exponentially Weighted Moving Averages).
GD with momentum
GD with momentum
● Here, ‘w’ and ‘b’ are updated not just based
on the current updates(derivative), but also
the past updates(derivatives).
● In the equation, ‘gamma * vt-1’
represents the history component.
● ‘Gamma’ value ranges from 0 to 1.
Deep Learning
Introduction to Convolution Neural Network, Convolution

Operation
By
Dr. Kumud Tripathi

The convolution operation
Examples of 2D convolutions applied to images
Working example of 2D convolution
Deep Learning
Input Size, Output Size, Filter Size, Padding and Stride
By
Dr. Kumud Tripathi

Relation between input size, output size and filter size
Padding:
Stride:
Stride:
Stride:
Depth of the Output:
Depth of the Output:
Let us do a few exercises:
H2 = ?
W2 = ?
D2 = ?
H2 = 55
W2 = 55
D2 = 96
H2 = ?
W2 = ?
D2 = ?
Deep Learning
ML Vs DL, CNN, FFNN Vs CNN, and It’s Characteristics
By
Dr. Kumud Tripathi

Machine learning Vs Deep Learning
● Instead of using handcrafted kernels such as edge detectors

can we learn meaningful kernels/filters in addition to
learning the weights of the classifier?
CNN
CNN
● Even better: Instead of using handcrafted kernels (such

as edge detectors) can we learn multiple meaningful
kernels/filters in addition to learning the weights of the
classifier?
CNN
CNN
● Can we learn multiple layers of meaningful kernels/filters in

addition to learning the weights of the classifier?
CNN
CNN
CNN: MNIST Dataset
● The MNIST database (Modified National Institute of Standards and
Technology database) is a large database of handwritten digits that is
commonly used for training various image processing systems.
● The images are black and white of size 28x28 pixel.
● The MNIST database contains 60,000 training images and 10,000 testing
images.
FFNN Vs CNN
FFNN Vs CNN
CNN Characteristics: Sparse Connectivity
CNN Characteristics: Weight Sharing
Deep Learning
CNN, Pooling, and Case Study
By
Dr. Kumud Tripathi

CNN
CNN
CNN
CNN: Pooling
CNN: Pooling
● Instead of max pooling we can also

do average pooling.
CNN
CNN
CNN
150
CNN
CNN
0
CNN
CNN
2400
CNN
CNN
0
CNN
CNN
48120
CNN
CNN
10164
CNN
26
2210
CNN
CNN
CNN
● We can thus train a convolution neural network using

backpropagation by thinking of it as a feedforward
neural network with sparse connections
Deep Learning
CNN on ImageNet: AlexNet, ZFNet, VGGNet
By
Dr. Kumud Tripathi

CNN
Success stories of CNN on ImageNET

ImageNet
● The ImageNet dataset is a very large collection of human annotated

photographs designed by academics for developing computer vision
algorithms.
● The ImageNet Large Scale Visual Recognition Challenge, or ILSVRC, is an
annual competition that uses subsets from the ImageNet dataset and is
designed to foster the development and benchmarking of state-of-the-art
algorithms.
● ImageNet has 14 million images in the dataset and more than 21 thousand
groups or classes.
ImageNet
CNN on ImageNet
AlexNet (Alex Krizhevsky Net)
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
AlexNet
ZFNet (Zeiler Furgus Net)
VGGNet
Deep Learning
Transfer Learning in CNN
By
Dr. Kumud Tripathi

Transfer Learning

The reuse of a pre-trained model on a new problem is known as transfer learning
in machine learning.

A machine uses the knowledge learned from a prior assignment to increase
prediction about a new task in transfer learning.

The knowledge of an already trained machine learning model is transferred to a
different but closely linked problem throughout transfer learning.

For example, if you trained a simple classifier to predict whether an image
contains a backpack, you could use the model’s training knowledge to identify
other objects such as sunglasses.
How Transfer Learning Works?

In computer vision, neural networks typically aim to detect edges in the first
layer, forms in the middle layer, and task-specific features in the latter layers.

The early and central layers are employed in transfer learning, and the latter
layers are only retrained.

It makes use of the labelled data from the task it was trained on.

The example of a model that has been intended to identify a backpack in an
image and will now be used to detect sunglasses. Because the model has trained
to recognise objects in the earlier levels, we will simply retrain the subsequent
layers to understand what distinguishes sunglasses from other objects.
Uses of Transfer Learning
Transfer learning offers a number of advantages, the most important of which are
1) reduced training time,
2) improved neural network performance, and
3) the absence of a large amount of data.

When to Use Transfer Learning
1) When we don’t have enough annotated data to train our model with.
2) When there is a pre-trained model that has been trained on similar data and tasks.
Deep Learning
Introduction to sequential learning
By
Dr. Kumud Tripathi

Sequential Learning
Sequential Learning
Sequential Learning
Sequential Learning
Sequential Learning
Sequential Learning
Sequential Learning
Sequential Learning
Transfer Learning
Transfer Learning
Deep Learning
Recurrent Neural Network: RNN
By
Dr. Kumud Tripathi

Recurrent Neural Network
Deep Learning
Backpropagation through time: BPTT
By
Dr. Kumud Tripathi

Backpropagation through time
Deep Learning
Long Short Term Memory Cells (LSTMs) and Gated

Recurrent Unit (GRU)
By
Dr. Kumud Tripathi

Long Short Term Memory Cells (LSTMs)
Selective Read, Selective Write, Selective Forget - The Whiteboard

Analogy
Selective Read, Selective Write, Selective Forget - The Whiteboard Analogy
Gated Recurrent Unit (GRU)
Deep Learning
Introduction to Autoencoders
By
Dr. Kumud Tripathi
Mitesh M. Khapra CS7015Deep Learning) : Lecture 7

An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi
h = g(W xi + b)
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

does the following
W∗
h representation h
W Decodes the input again from this
hidden representation
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗

does the following
W∗
h representation h
W Decodes the input again from this
hidden representation
xi The model is trained to minimize a
certain loss function which will ensure
that ˆxi is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f(W h + c)
∗

Let us consider the case where
x̂i dim(h) < dim(x i )
W∗
h
W
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗

If we are still able to reconstruct ˆxi
W∗
perfectly from h, then what does it
h say about h?
W h is a loss-free encoding of ix. It cap-
tures all the important characteristics
xi of xi
h = g(W xi + b)
x̂i = f(W h + c)
∗

If we are still able to reconstruct ˆxi
W∗
perfectly from h, then what does it
h say about h?
W h is a loss-free encoding of ix. It cap-
tures all the important characteristics
xi of xi
h = g(W xi + b)
x̂i = f(W h + c)
∗
An autoencoder where dim(h) < dim(x i) is

called an under complete autoencoder

Let us consider the case when
x̂i dim(h) ≥ dim(x i )
W∗
h
W
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗

W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
W h into ˆxi
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗

W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
W h into ˆxi
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f(W h + c)
∗
An autoencoder where dim(h) ≥ dim(x i) is

called an over complete autoencoder

The Road Ahead
Choice of f (xi ) and g(x i )
Choice of loss function

Suppose all our inputs are binary
x̂i = f(W h + c)
∗
(each xij ∈ {0, 1})
W∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)

x̂i = f(W h + c)
∗
(each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W h + c)
∗
W
x̂i = W h + c
∗
xi
x̂i = logistic(W h + c)
∗

x̂i = f(W h + c)
∗
(each xij ∈ {0, 1})
h = g(W xi + b)
∗
W
x̂i = W h + c
∗
xi
∗
0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

outputs to be between 0 and 1
g is typically chosen as the sigmoid

function

Suppose all our inputs are real (each
x̂i = f(W h + c)
∗
x ij ∈ R)
W∗
h = g(W xi + b)
W
xi
0.25 0.5 1.25 3.5 4.5

(real valued inputs)

x̂i = f(W h + c)
∗
x ij ∈ R)
h = g(W xi + b)
∗
W
x̂i = W h + c
∗
xi
∗
0.25 0.5 1.25 3.5 4.5


x̂i = f(W h + c)
∗
x ij ∈ R)
h = g(W xi + b)
∗
W
x̂i = W h + c
∗
xi
∗
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?

x̂i = f(W h + c)
∗
x ij ∈ R)
h = g(W xi + b)
∗
W
x̂i = W h + c
∗
xi
∗
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
Again, g is typically chosen as the n
whereas we wantxî ∈ R
sigmoid function

The Road Ahead
Choice of f (xi ) and g(x i )
Choice of loss function

Consider the case when the inputs are real
x̂i valued
W∗
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗

x̂i valued
The objective of the autoencoder is to recon-
W∗
struct ˆxi to be as close to xi as possible
h This can be formalized using the following

objective function:
W
1 X X
m n
min (x̂ ij − x ij ) 2
xi W,W∗
,c,b m
i=1 j=1
1 X
m
T
h = g(W xi + b) i.e., min (ˆxi − x i ) (ˆxi − x i )
W,W ∗ ,c,b m
x̂i = f(W h + c) i=1
∗

x̂i valued
The objective of the autoencoder is to recon-
W∗
struct ˆxi to be as close to xi as possible
h This can be formalized using the following

objective function:
W
1 X X
m n
min (x̂ ij − x ij ) 2
xi W,W∗
,c,b m
i=1 j=1
1 X
m
T
h = g(W xi + b) i.e., min (ˆxi − x i ) (ˆxi − x i )
W,W ∗ ,c,b m
x̂i = f(W h + c) i=1
∗
We can then train the autoencoder just like

a regular feedforward network using back-
propagation
All we need is a formula for ∂L∂W(θ)∗ and ∂L (θ)
∂W
which we will see now

T
L (θ) = (ˆ xi − x i ) (ˆxi − x i ) ∂L (θ) ∂L (θ) ∂h 2 ∂a2
=
∂W ∗ ∂h 2 ∂a2 ∂W ∗
h2 = ˆxi
a2
∂L (θ) ∂L (θ) ∂h 2 ∂a2 ∂h 1 ∂a1
=
W∗ ∂W ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W
h1
a1
W
h0 = x i
Note that the loss function is

shown for only one training
example.

T
L (θ) = (ˆ xi − x i ) (ˆxi − x i ) ∂L (θ) ∂L (θ) ∂h 2 ∂a2
=
∂W ∗ ∂h 2 ∂a2 ∂W ∗
h2 = ˆxi
a2
∂L (θ) ∂L (θ) ∂h 2 ∂a2 ∂h 1 ∂a1
=
W∗ ∂W ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = x i ∂L (θ) ∂L (θ)
=
∂h 2 ∂ˆxi
T
= ∇ x̂i {(ˆxi − x i ) (ˆxi − x i )}
Note that the loss function is = 2(ˆxi − xi )
shown for only one training
example.

Consider the case when the inputs are
x̂i = f(W h + c) binary
∗
W∗
h = g(W xi + b)
xi

∗
W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.
W For a single n-dimensional ith input we

can use the following loss function
xi Xn
min{− (x ij log x̂ ij + (1 − x ij ) log(1 − x̂ ij ))}
j=1
What value of ˆx ij will minimize this
function?

∗


xi Xn
j=1
What value of ˆx ij will minimize this
function?
If x ij = 1 ?
If x ij = 0 ?

∗


xi Xn
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of ˆx ij will minimize this ∂L (θ)
to use backpropagation
function? ∂W
If x ij = 1 ?
If x ij = 0 ?
Indeed the above function will be
minimized when x̂ ij = x ij !
Pn ∂L (θ) ∂L (θ) ∂h 2 ∂a2
L (θ) = − (x ij log x̂ ij + (1 − x ij ) log(1 − x̂ ij )) =
j=1 ∂W ∗
∂h 2 ∂a2 ∂W ∗
h2 = ˆxi
a2 ∂L (θ) ∂L (θ) ∂h 2 ∂a2 ∂h 1 ∂a1
=
∂W ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W
W∗
h1
a1
W
h0 = x i

Pn ∂L (θ) ∂L (θ) ∂h 2 ∂a2
L (θ) = − (x ij log x̂ ij + (1 − x ij ) log(1 − x̂ ij )) =
j=1 ∂W ∗
∂h 2 ∂a2 ∂W ∗
h2 = ˆxi
a2 ∂L (θ) ∂L (θ) ∂h 2 ∂a2 ∂h 1 ∂a1
=
∂W ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = x i The first two terms on RHS can be
computed as:
∂L (θ) x ij 1 − x ij
  =− +
∂L (θ) ∂h 2j x̂ ij 1 − x̂ ij
∂h 21
 ∂L (θ)
 ∂h 2j
∂L (θ)   = σ(a 2j )(1 − σ(a 2j ))

= 
∂h 22  ∂a 2j
∂h 2 .. 

 . 

∂L (θ)
∂h 2n
Deep Learning
Regularization in autoencoders, Denoising
autoencoders, Sparse autoencoders, Contractive autoencoders
By
Dr. Kumud Tripathi

Regularization in autoencoders

While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W
then h to ˆxi
xi

While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W
then h to ˆxi
xi To avoid poor generalization, we need
to introduce regularization

The simplest solution is to add a L 2-
x̂i regularization term to the objective
function
W∗
m n
1 X X
h min (x̂ ij − x ij ) 2 + λkθk 2
θ,w,w ∗ ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)

Denoising Autoencoders

A denoising encoder simply corrupts
the input data using a probabilistic
x̂i x ij |x ij )) before feeding it
process (P (e
to the network
A simple P (ex ij |x ij ) used in practice
h
is the following
P (e x ij = 0|x ij ) = q
x̃i x ij = x ij |x ij ) = 1 − q
P (e
x ij |x ij )
P (e
xi

How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
1 X X
m n
h
arg min (x̂ ij − x ij ) 2
θ m
i=1 j=1
x̃i
x ij |x ij )
P (e
xi

How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
1 X X
m n
h
arg min (x̂ ij − x ij ) 2
θ m
i=1 j=1
x̃i It no longer makes sense for the model

to copy the corrupted xei into h( xei )
x ij |x ij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.

We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders

0 1 2 3 9
Task:Hand-written digit
recognition
|xi | = 784 = 28 × 28
28*28
Figure: Basic approach(we use raw data as input

Figure: MNIST Data features)

x̂i ∈ R 784
Task:Hand-written digit
recognition
d
h∈R
|xi | = 784 = 28 × 28
Figure: MNIST Data
Figure: AE approach (first learn important

characteristics of data)
Mitesh M. Khapra Lecture 7

Task:Hand-written digit 0 1 2 3 9
recognition
d
h∈R
|xi | = 784 = 28 × 28
Figure: MNIST Data
Figure: AE approach (and then train a classifier on

top of this hidden representation)

We will now see a way of visualizing AEs and use this visualization to compare
different AEs

We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration x i
For example,
h
T
h1 = σ(W 1 xi ) [ignoring bias b]
xi Where W1 is the trained vector of weights con-

necting the input to the first hidden neuron

For example,
h
T

What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that kxi k = 1

For example,
h
T

What values of xi will cause h1 to be max-
imum (or maximally activated)
T
max {W 1 xi } Suppose we assume that our inputs are nor-
xi
s.t. ||xi || 2 = x Ti xi = 1
malized so that kxi k = 1
W1
Solution: xi = p
W1T W1

Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q , . . .p
W1T W1 W2T W2 WnT Wn
h
will respectively cause hidden neurons 1 to n

xi to maximally fire
T
max {W 1 xi }
xi
s.t. ||xi || 2 = x Ti xi = 1
W1
Solution: xi = p
W1T W1

Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q , . . .p
W1T W1 W2T W2 WnT Wn
h
will respectively cause hidden neurons 1 to n

xi to maximally fire
Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
T
max {W 1 xi } coder and different denoising autoencoders
xi
s.t. ||xi || 2 = x Ti xi = 1 These xi ’s are computed by the above formula

W1 using the weights (W1, W2 . . . Wk ) learned by
Solution: xi = p the respective autoencoders
W1T W1

Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)
The vanilla AE does not learn many meaningful patterns

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
We saw one form of P (e x ij |x ij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
xeij = x ij + N (0, 1)
x̃i We will now use such a denoising AE on a

different dataset and see their performance
x ij |x ij )
P (e
xi

Figure: Weight decay
Figure: Data Figure: AE filters
filters
The hidden neurons essentially behave like edge detectors

Sparse Autoencoders

A hidden neuron with sigmoid activation will
x̂i
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
xi neuron is inactive most of the times.

x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function
The average value of the Xk ρ 1−ρ

Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi ) l
m and what is the minimum value? Let us plot
i=1
it and check.

ρ = 0.2
Ω(θ)
0.2 ρ̂l
The function will reach its minimum value(s) when ˆρl = ρ.

Xk ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ˆρl
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
Xk L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ˆ ρl ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) . ∂ˆρ ∂L (θ)
= ∂W
∂W ∂ˆρ ∂W
Let us see how to calculate ∂Ω(θ)
∂W .
∂Ω(θ) h iT
∂Ω(θ) , ∂Ω(θ) , . . .∂Ω(θ)
=
∂ˆρ ∂ ρ̂ 1 ∂ ρ̂ 2 ∂ ρ̂
k
For each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ˆρl
∂ ρ̂l 0 T T
and
∂W = x
i (g (W xi + b)) (see next slide)

Xk ρ 1−ρ Now,
ρ̂l 1 − ˆρl
l=1 Lˆ(θ) = L (θ) + Ω(θ)
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) . ∂ˆρ ∂L (θ)
= ∂W
∂W ∂ˆρ ∂W
∂W .
∂Ω(θ) h iT
∂Ω(θ) , ∂Ω(θ) , . . .∂Ω(θ)
= Finally,
∂ˆρ ∂ ρ̂ 1 ∂ ρ̂ 2 ∂ ρ̂
k
For each neuron l ∈ 1 . . . k in hidden layer, we have ∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)

∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l 0 T T (and we know how to calculate both
and
∂W = x
terms on R.H.S)

Derivation
∂ ρ̂ ∂ ρ̂1 ∂ ρ̂2 ρ̂ k
= . . . ∂∂W
∂W ∂W ∂W
∂ l ρ̂
For each element in the above equation we can calculate∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix W jl :-
h Pm i
∂ m1 g W :,lT xi + bl
∂ ρ̂l i=1
=
∂W jl ∂W jl
h i
m ∂ gW T
1 X :,l xi + b l
=
m ∂W jl
i=1
1 X 0 T
m
= g W:,l xi + bl x ij
m
i=1
So in matrix notation we can write it as :

∂ ρ̂l 0 T T
= x i (g (W xi + b))
∂W

Deep Learning
Regularization in autoencoders,
Sparse autoencoders, Contractive autoencoders
By
Dr. Kumud Tripathi

Sparse Autoencoders

A hidden neuron with sigmoid activation will
x̂i
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
xi neuron is inactive most of the times.

x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ˆρl = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function
The average value of the Xk ρ 1−ρ

Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi ) l
m and what is the minimum value? Let us plot
i=1
it and check.

ρ = 0.2
Ω(θ)
0.2 ρ̂l
The function will reach its minimum value(s) when ˆρl = ρ.

Xk ρ 1−ρ Now,
ρ̂l 1 − ˆρl
l=1 Lˆ(θ) = L (θ) + Ω(θ)
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) . ∂ˆρ ∂L (θ)
= ∂W
∂W ∂ˆρ ∂W
∂W .
∂Ω(θ) h iT
∂Ω(θ) , ∂Ω(θ) , . . .∂Ω(θ)
=
∂ˆρ ∂ ρ̂ 1 ∂ ρ̂ 2 ∂ ρ̂
k
For each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l 0 T T
and
∂W = x

Xk ρ 1−ρ Now,
ρ̂l 1 − ˆρl
l=1 Lˆ(θ) = L (θ) + Ω(θ)
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) . ∂ˆρ ∂L (θ)
= ∂W
∂W ∂ˆρ ∂W
∂W .
∂Ω(θ) h iT
∂Ω(θ) , ∂Ω(θ) , . . .∂Ω(θ)
= Finally,
∂ˆρ ∂ ρ̂ 1 ∂ ρ̂ 2 ∂ ρ̂
k
For each neuron l ∈ 1 . . . k in hidden layer, we have ∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)

∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l 0 T T (and we know how to calculate both
and
∂W = x
terms on R.H.S)

Derivation
∂ ρ̂ ∂ ρ̂1 ∂ ρ̂2 ρ̂k
= . . . ∂∂W
∂W ∂W ∂W
∂ l ρ̂
For each element in the above equation we can calculate∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix W jl :-
h Pm i
∂ m1 g W :,lT xi + bl
∂ ρ̂l i=1
=
∂W jl ∂W jl
h i
m ∂ g W Tx + b
1 X :,l i l
=
m ∂W jl
i=1
1 X 0 T
m
= g W:,l xi + bl x ij
m
i=1
So in matrix notation we can write it as :

∂ ρ̂l 0 T T
∂W = x i
(g (W xi + b))

Contractive Autoencoders

A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h
Ω(θ) = ||J x (h)|| 2F

x
where Jx (h) is the Jacobian of the en-
coder.
Let us see what it looks like.

If the input has n dimensions and the
hidden layer has k dimensions then  ∂h 1
∂x 1
... ... ... ∂h 1
∂x n

 ∂h 2
... ... ... ∂h 2 
In other words, the (l, j) entry of the  ∂x 1 ∂x n 
J x(h) =  .. .. .. 
Jacobian captures the variation in the .
 . .
output of the l th neuron with a small  ∂h k
... ... ... ∂h k 
variation in the j th input. ∂x 1 ∂x n
Xn Xk ∂h l 2
kJ x (h)k 2F =
∂x j
j=1 l=1

What is the intuition behind this ? Xn Xk ∂h l 2
Consider
∂h 1 kJ x (h)k 2F =
∂x 1 , what does it mean if ∂x j
∂h 1 j=1 l=1
∂x 1 = 0
It means that this neuron is not very
sensitive to variations in the input x1.
x̂
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the h
input.

Indeed it does and that’s the idea Xn Xk ∂h l 2
By putting these two contradicting kJ x (h)k 2F =
∂x j
objectives against each other we en- j=1 l=1
sure that h is sensitive to only very

important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import- x
ant variations in the data

Let us try to understand this with the help of an illustration.

y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?

Summary

x̂i
Regularization
2
h Ω(θ) = λkθk Weight decaying
Xk ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
l=1
ρ̂l 1 − ˆρl
x̃i Xn Xk ∂h l 2
Ω(θ) = Contractive
ij x j=1 l=1
∂x j
xi

Deep Learning
Variational Autoencoder
By
Dr. Kumud Tripathi

Autoencoder
Latent Spaces
Latent Spaces
Latent Spaces
Latent Spaces
Latent Spaces
VAE Architecture
VAE Loss Function
Training VAE
Loss Layer
Summary
AE Vs VAE
AE Vs VAE
Autoencoder (AE)
· Used to generate a compressed transformation of input in a latent space
· The latent variable is not regularized
· Picking a random latent variable will generate garbage output
· Latent variable is deterministic values
· The latent space lacks the generative capability

AE Vs VAE
Variational Autoencoder (VAE)
· Enforces conditions on the latent variable to be the unit norm
· The latent variable in the compressed form is mean and variance
· A random value of latent variable generates meaningful output at the decoder
· The input of the decoder is stochastic and is sampled from a gaussian with mean and variance of the output of the encoder.
· Regularized latent space
· The latent space has generative capabilities.

Generative Adversarial Networks
(GANs)
By-
Dr. Kumud Tripathi
Outline
What are GANs?
Intuition behind GANs?
GANs Architecture
Training Procedure
Why GANs?
2
What are GANs?
Generative
Learn a generative model
Adversarial
Trained in an adversarial setting
Networks
Use Deep Neural Networks

3
What are GANs?
Figure 1. The generator tries to generate fake images while taking random noise as input and the
discriminator tries to classify it as real or fake.
4 https://learnopencv.com/introduction-to-generative-adversarial-networks/#generator
Intuition behind GANs?
Figure 2. The counterfeiter trying to generate fake money using a feedback mechanism from the police.
5 https://learnopencv.com/introduction-to-generative-adversarial-networks/#generator
GANs Architecture
Figure 3. Block diagram of GANs. (Z is some random noise (Gaussian/Uniform). Z can be thought as the
latent representation of the image. )
6 https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016
Training Discriminator
Figure 4. Block diagram of Discriminator Training.
Training Generator
Figure 4. Block diagram of Generator Training.
GAN’s Formulation
minmaxV(D, G)
G D
It is formulated as a minimax game, where:

The Discriminator is trying to maximize its reward ( , )
The Generator is trying to minimize Discriminator’s reward (or
maximize its loss)
V(D, G) = [logD(x)] + [log(1 − D(G(z)))]
x∼p(x) z∼q(z)
The Nash equilibrium of this particular game is achieved at:
Pdata(x) = Pgenx ∀x
𝔼
𝔼
9
𝑽
𝑫
𝑮
Why GANs?
Image-to-Image Translation
Text-to-Image Synthesis
Data Augmentation
Low-Resolution to High-Resolution
Synthetic Speech Generation,
Voice Translation
Speech Style Translation

10
Deep Learning
Generative Adversarial Networks (GANs)
Dr. Kumud Tripathi
Mitesh M. Khapra
As usual we are given some training data (say, MNIST images) which obviously
comes from some underlying distribution
Our goal is to generate more images from this distribution (i.e., create images
which look similar to the images from the training data)
In other words, we want to sample from a complex high dimensional distribution
which is intractable (recall VAEs Models )
Mitesh M. Khapra
Complex Transformation
Sample Generated
z ∼ N (0, I)
GANs take a different approach to this problem where the idea is to sample
from a simple tractable distribution (say, z ∼ N (0, I)) and then learn a complex
transformation from this to the training distribution
In other words, we will take a z ∼ N (0, I), learn to make a series of complex
transformations on it so that the output looks as if it came from our training
distribution
Mitesh M. Khapra
What can we use for such a complex
transformation?
transformation? A Neural Network
How do you train such a neural network?
Mitesh M. Khapra
How do you train such a neural network? Using a
two player game
Mitesh M. Khapra
Real or Fake How do you train such a neural network? Using a
two player game
Discriminator
There are two players in the game: a generator
and a discriminator
The job of the generator is to produce images
which look so natural that the discriminator
Generator Real Images
thinks that the images came from the real data
distribution
The job of the discriminator is to get better and
z ∼ N (0, I) better at distinguishing between true images and
generated (fake) images
So let’s look at the full picture
Real or Fake
Discriminator
z ∼ N (0, I)
So let’s look at the full picture
Let Gφ be the generator and D θ be the
Real or Fake discriminator (φ and θ are the parameters of G
and D, respectively)
Discriminator
We have a neural network based generator which
takes as input a noise vector z ∼ N (0, I) and
produces Gφ (z) = X
We have a neural network based discriminator
which could take as input a real X or a generated
X = G φ (z) and classify the input as real/fake
z ∼ N (0, I)
What should be the objective function of the
overall network?
Real or Fake Let’s look at the objective function of the
generator first
Discriminator
Given an image generated by the generator as
Gφ (z) the discriminator assigns a score D
θ (G φ (z))
to it
This score will be between 0 and 1 and will tell us
the probability of the image being real or fake
For a given z, the generator would want
to maximize log Dθ (G φ (z)) (log likelihood) or
z ∼ N (0, I) minimize log(1 − D θ (G φ (z)))
This is just for a single z and the generator would
like to do this for all possible values of z,
Real or Fake For example, if z was discrete and drawn from a
uniform distribution (i.e., p(z) = N1 ∀z) then the
Discriminator generator’s objective function would be
XN 1
min log(1 − D θ (G φ (z)))
φ N
i=1
However, in our case, z is continuous and not
uniform (z ∼ N (0, I)) so the equivalent objective
function would be
z ∼ N (0, I) ˆ
min p(z) log(1 − D θ (G φ (z)))
φ
min E [log(1 − D (G (z)))]

Now let’s look at the discriminator
Real or Fake
Discriminator
z ∼ N (0, I)
Now let’s look at the discriminator
The task of the discriminator is to assign a high
Real or Fake score to real images and a low score to fake images
And it should do this for all possible real images
Discriminator
and all possible fake images
In other words, it should try to maximize the
following objective function
Generator Real Images max E x∼p [log Dθ (x)]+E [log(1−D θ (G φ (z)))]
θ data z∼p(z)
z ∼ N (0, I)
If we put the objectives of the generator and
discriminator together we get a minimax game
Real or Fake
min max [Ex∼p data log Dθ (x)
φ θ
Discriminator
+ E z∼p(z) log(1 − D θ (G φ (z)))]
The first term in the objective is only w.r.t. the

parameters of the discriminator (θ)
The second term in the objective is w.r.t. the
parameters of the generator (φ) as well as the
discriminator (θ)
z ∼ N (0, I) The discriminator wants to maximize the second
term whereas the generator wants to minimize it
(hence it is a two-player game)
So the overall training proceeds by alternating
between these two step
Real or Fake Step 1: Gradient Ascent on Discriminator
Discriminator max [Ex∼p data log Dθ (x)+E z∼p(z) log(1−D θ (G φ (z)))]

θ
Step 2: Gradient Descent on Generator
Generator Real Images min Ez∼p(z) log(1 − D θ (G φ (z)))

φ
In practice, the above generator objective does not

work well and we use a slightly modified objective
z ∼ N (0, I)
Let us see why
When the sample is likely fake, we want
to give a feedback to the generator (using
4 gradients)
log(1 − D(g(x)))
However, in this region where D(G(z)) is close
2 to 0, the curve of the loss function is very flat
and the gradient would be close to 0
Loss
−2
−4
0 0.2 0.4 0.6 0.8 1
D(G(z))
When the sample is likely fake, we want
to give a feedback to the generator (using
4 gradients)
log(1 − D(g(x)))
− log(D(g(x))) However, in this region where D(G(z)) is close
2 to 0, the curve of the loss function is very flat
and the gradient would be close to 0
Loss
0 Trick: Instead of minimizing the likelihood of

the discriminator being correct, maximize the
−2 likelihood of the discriminator being wrong
In effect, the objective remains the same but
−4
0 0.2 0.4 0.6 0.8 1 the gradient signal becomes better
D(G(z))
With that we are now ready to see the full algorithm for training GANs
1: procedure GAN Training

2: for number of training iterations do
3: for k steps do
4: • Sample minibatch of m noise samples {z(1) , .., z(m) } from noise prior p g (z)
5: • Sample minibatch of m examples {x (1) , .., x(m) } from data generating distribution p data (x)
6: • Update the discriminator by ascending its stochastic gradient:
1 X h i
m
∇θ log Dθ x (i) + log 1 − D θ Gφ z(i)
m
i=1
7: end for
8: • Sample minibatch of m noise samples {z(1) , .., z(m) } from noise prior p g (z)
9: • Update the generator by ascending its stochastic gradient
1 X h i
m
∇φ log D θ Gφ z(i)
m
i=1
10: end for

11: end procedure
Mitesh M. Khapra
Generative Adversarial Networks - Architecture
Mitesh M. Khapra
We will now look at one of the popular neural networks used for the generator
and discriminator (Deep Convolutional GANs)
For discriminator, any CNN based classifier with 1 class (real) at the output
can be used (e.g.VGG, ResNet, etc.)
Figure: Generator (Redford et al 2015) (left) and discriminator (Yeh et al 2016) (right)
used in DCGAN
Architecture guidelines for stable Deep Convolutional GANs
Replace any pooling layers with strided convolutions (discriminator) and
fractional-strided convolutions (generator).
Use batchnorm in both the generator and the discriminator.
Remove fully connected hidden layers for deeper architectures.
Use ReLU activation in generator for all layers except for the output, which
uses tanh.
Use LeakyReLU activation in the discriminator for all layers
VAEs GANs
Abstraction Yes No
Generation Yes Yes
Compute P(X) Intractable No
Sampling Fast Fast
Loss KL-divergence JSD

Assumptions X independent given z N.A.
Samples Ok Good (best)
Table: Comparison of Generative Models
Recent works look at combining these methods:e.g. Adversarial Autoencoders (Makhzani

2015), PixelVAE (Gulrajani 2016) and PixelGAN Autoencoders (Makhzani 2017)
Generative Adversarial Networks – Some Cool Stuff and
Applications
Image inpainting
Example of GAN-Generated Photographs of Bedrooms.Taken from Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks, 2015.
Example of Vector Arithmetic for GAN-Generated Faces.Taken from Unsupervised Representation Learning
with Deep Convolutional Generative Adversarial Networks, 2015.
Examples of Photorealistic GAN-Generated Faces.Taken from Progressive Growing of GANs for
Improved Quality, Stability, and Variation, 2017.
Example of the Progression in the Capabilities of GANs from 2014 to 2017.Taken from The Malicious Use of
Artificial Intelligence: Forecasting, Prevention, and Mitigation, 2018.

DL Full Merged

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Full Merged

Uploaded by

Copyright:

Available Formats

Deep Learning

Introduction to Deep Learning

Dr. Kumud Tripathi

● Mid Sem - 40 ● Assignments - 20

Note: 3~4 students per group

Image Source: Google

Image Source: Kaggle

Definition Learns by using labelled Trained using unlabelled Works on interacting

Type of data Labelled data Unlabelled data No – predefined data

Type of problems Regression and Clustering Exploitation or

Supervision Extra supervision No supervision No supervision

Algorithms Linear Regression, K – Means, Q – Learning,

Aim Calculate outcomes Discover underlying Learn a series of action

Application Risk Evaluation, Forecast Recommendation Self Driving Cars,

Dr. Kumud Tripathi

This is the desired behavior of an AND gate.

x1= 1 (TRUE), x2= 1 (TRUE)

w0 = -.8, w1 = 0.5, w2 = 0.5

=> o(x1, x2) => -.8 + 0.5*1 + 0.5*1 = 0.2 > 0

This is the desired behavior of an OR gate.

w0 = -.3, w1 = 0.5, w2 = 0.5

=> o(x1, x2) => -.3 + 0.5*1 + 0.5*0 = 0.2 > 0

Dr. Kumud Tripathi

● Composed of several Perceptron-like units arranged in

● Hidden layers can automatically extract features from data

● Activation functions should be differentiable, so that a network’s parameters can be updated

● The equation below describes what

Multilayer Perceptron Training

Dr. Kumud Tripathi

● To calculate the ﬁnal result of H1, we performed the sigmoid function as

● We will calculate the value of H2 in the same way as H1

● To calculate the ﬁnal result of H1, we performed the sigmoid function as

● Now Y2 = 1.2249214 and

● So the total Error is,

● We perform backward process so first consider the last weight w5 as

Different activation functions their advantages and

Dr. Kumud Tripathi

● Why Do We Need Activation Function?

● So Why Do Nonlinear Functions Needed?

● Why Do We Need Activation Function?

● So Why Do Nonlinear Functions Needed?

● It is a function that takes a binary value and is used as a binary classifier.

● Derivative of the sigmoid is:

● Derivative of tanh function is:

● More computation expensive than sigmoid function.

● The derivative of ReLU is,

● The Derivative of Leaky ReLU is,

● Softmax function is often described as a combination of multiple sigmoids.

Loss and Cost Functions

Dr. Kumud Tripathi

1. Regression Cost Function

As a result, it is also not a distance metric.

Dataset Splitting, Bias vs Variance trade-off

Dr. Kumud Tripathi

● The sample of data used to fit the model.

● The sample of data used to provide an unbiased evaluation of a model fit on

● The sample of data used to provide an unbiased evaluation of a final model

● This mainly depends on 2 things.

● Let there be n training points and m test (validation) points

Regularization: Data Augmentation, Early Stopping

Dr. Kumud Tripathi

● Improving model prediction accuracy

1. Training model on a preset number of epochs

Cross Validation and Regularization: L2 and L1 Regularization

=> o(x1, x2) => -.8 + 0.51 + 0.51 = 0.2 > 0

=> o(x1, x2) => -.3 + 0.51 + 0.50 = 0.2 > 0