Professional Documents
Culture Documents
By
1. Deep Learning outperform other techniques if the data size is large. But with small data size,
traditional Machine Learning algorithms are preferable.
2. Deep Learning techniques need to have high end infrastructure to train in reasonable time.
3. When there is lack of domain understanding for feature introspection, Deep Learning
techniques outshines others as you have to worry less about feature engineering.
4. Deep Learning really shines when it comes to complex problems such as image
classification, natural language processing, and speech recognition.
Deep Learning Applications
1. Self-Driving Cars
2. Voice Controlled Assistance
3. Automatic Image Caption Generation
4. Automatic Machine Translation
Deep Learning Vs Machine Learning
The mapping function is used for mapping Mapping Function is used for the mapping
Basic
values to predefined classes. of values to continuous output.
Involves
Discrete values Continuous values
prediction of
Nature of the
Unordered Ordered
predicted data
Method of
by measuring accuracy by measurement of root mean square error
calculation
Example
Decision tree, logistic regression, etc. Random forest, Linear regression, etc.
Algorithms
Deep Learning
Perceptron
By
Note: The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision
boundary.
How Does Perceptron Work?
● Weights shows the strength of the particular node.
● A bias value allows you to shift the activation function curve up or down.
How Does Perceptron Work?
1. The weights are initialized with the random values at the origination of each training.
2. Multiply all input values with corresponding weight values and then add to calculate the weighted
sum. The following is the mathematical expression of it:
a. ∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4
3. An activation function is applied with the above-mentioned weighted sum giving us an output in
binary form as follows:
a. Y=f(∑wi*xi + b)
4. For each element of the training set, the error is calculated with the difference between the
desired output and the actual output. The calculated error is used to adjust the weight.
5. The process is repeated until the fault made on the entire training set is less than the specified
limit until the maximum number of iterations has been reached.
Perceptron Algorithm Training Procedure
1. Initialize our weight vector w with small random values
2. Until Perceptron converges:
a. Loop over each feature vector xj and true class label di in our training set D
b. Take x and pass it through the network, calculating the output value: yj = f(w(t) ·
xj)
c. Update the weights w: wi(t +1) = wi(t) +α(dj −yj)xj,i for all features 0 <= i <= n
Activation Function of Perceptron Model
● Activation functions are used to map the input between the required values like (0, 1) or (-1, 1).
Limitation of Perceptron Model
1. The output of a perceptron can only be a binary number (0 or 1) due to the hard-edge transfer
function.
2. It can only be used to classify the linearly separable sets of input vectors. If the input vectors are
non-linear, it is not easy to classify them correctly.
Implementing Basic Logic Gates With
Perceptron
1. AND
If the two inputs are TRUE (+1), the output of Perceptron is positive, which amounts to TRUE.
If either of the two inputs are TRUE (+1), the output of Perceptron is positive, which amounts to TRUE.
x1 = 1 (TRUE), x2 = 0 (FALSE)
3. XOR
Deep Learning
Multilayer Perceptron
By
On a feedforward neural network, we have a set of input features and some random weights. Notice
that in this case, we are taking random weights that we will optimize using backward propagation.
2. Backpropagation:
Backpropagation is an algorithm for update the weights and biases of a model based on their gradients
with respect to the error function, starting from the output layer all the way to the first layer.
● Gradient:
○ A gradient measures how much the output of a function changes if you change the inputs a
little bit.
○ In machine learning, a gradient is a derivative of a function also Known as the slope of a
function in mathematical terms.
● Gradient Descent
○ Gradient Descent is an optimization algorithm for finding a local minimum of a
differentiable function.
○ The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration.
● The cost function is defined as the measurement of difference or error between actual values
and expected values.
How Gradient Descent Works?
● It is defined as the step size taken to reach the minimum or lowest point.
● This is typically a small value that is updated based on the behavior of the cost function.
● If the learning rate is high, it results in larger steps but also leads to risks of overshooting the
minimum.
● A low learning rate shows the small step sizes, which compromises overall efficiency but gives the
advantage of more precision.
Deep Learning
By
● To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925
● Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
● Now Y1(Final):
● Total Error:
● Now, we will backpropagate this error to update the weights using a backward pass.
Backward pass at the output layer
● To update the weight, we calculate the error correspond to each weight with the help of a total
error.
● The error on weight w is calculated by differentiating total error with respect to w.
● Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
Backward pass at the output layer
Backward pass at the output layer
Backward pass at the output layer
So, we put the values of in equation no (3) to find the final result.
Now, we will calculate the updated weight w5new with the help of the following formula
Deep Learning
By
● Why can’t we switch this signal to the output without activating it?
● Why can’t we switch this signal to the output without activating it?
○ Without activation function the output signal becomes a simple linear function.
1. Step Function
2. Sigmoid Function
3. Tanh Function
4. ReLU Function
5. Leaky ReLU Function
6. Softmax Function
Step Function
Advantages:
● The output value is between 0 and 1.
● The prediction is simple, ie based on a threshold probability value.
Disadvantages:
● Computationally expensive
● Outputs not zero centered
● Vanishing gradient—for very high or very low values of X, there is almost no change to the
prediction, causing a vanishing gradient problem. This can result in the network refusing to learn
further, or being too slow to reach an accurate prediction.
Hyperbolic Tangent (tanh) Function
● The tanh function is similar to the sigmoid function. The output ranges from -1 to 1.
● The Mathematical function of tanh function is:
g
Advantages:
● Zero Centered
● The prediction is simple, ie based on a threshold probability value.
Disadvantages
● Note: If we have input less than 0, then it outputs zero, and the neural network can't continue the
backpropagation algorithm. This problem is commonly known as Dying ReLU. To get rid of this
problem we use an improvised version of ReLU, called Leaky ReLU.
Rectified Linear Unit (ReLU) Function
Advantages:
● No gradient vanishing
● Derivative is constant
● Computationally efficient
Disadvantages:
● Dying ReLU problem i.e for the inputs 0 or negative the gradient of ReLU becomes zero and thus
the network cannot make backpropagation.
Leaky Rectified Linear Unit (Leaky ReLU)
Function
● Leaky ReLU is the most common and effective method to solve a dying ReLU
problem.
● It is nothing but an improved version of the ReLU function.
● It adds a slight slope in the negative range to prevent the dying ReLU issue.
● The mathematical representation of Leaky ReLU is,
Disadvantages:
● Leaky ReLU does not provide consistent predictions for negative input values.
Softmax Function
By
● loss/Cost function in Machine learning helps us understand the difference between the predicted
value & the actual value.
● But, the Loss function is associated with every training example, and the cost function is the
average value of the loss function over all the training samples.
● In Machine learning, we usually try to optimize our cost function rather than loss function.
Types of the Cost Functions
Cost functions can be of various types depending on the problem. However, mainly it is of two types, which are
as follows:
● Regression models deal with predicting a continuous value for example salary of an employee,
price of a car, loan prediction, etc.
● They are calculated on the distance-based error as follows:
Error = y-y’
● Where,
Y – Actual Input
Y’ – Predicted output
Regression Cost Functions
● The most used Regression cost functions are below,
a. Mean Squared Error
b. Mean Absolute Error
Mean Squared Error (MSE)
● It is measured as the average of the sum of squared differences between predictions and actual
observations.
● Here a square of the difference between the actual and predicted value is calculated to avoid
any possibility of negative error.
● It is also known as L2 loss.
● In MSE, since each error is squared, it helps to penalize even small deviations in prediction
when compared to MAE.
● But if our dataset has outliers that contribute to larger prediction errors, then squaring this
error further will magnify the error many times more and also lead to higher MSE error.
● Hence we can say that it is less robust to outliers.
Mean Squared Error (MSE)
(a) Without Outlier (b) With Outlier
Mean Absolute Error (MAE)
● MAE is measured as the average of the sum of absolute differences between predictions and
actual observations.
● Here an absolute difference between the actual and predicted value is calculated to avoid any
possibility of negative error.
● It is also known as L1 Loss.
● It is robust to outliers thus it will give better results even when our dataset has noise or
outliers.
Mean Absolute Error (MAE)
(a) Without Outlier (b) With Outlier
Classification Cost Functions
● The most used classification cost functions are below,
a. Cross Entropy Loss
b. KL Divergence Loss
c. Hinge Loss
Cross Entropy Loss
● Entropy:
○ Entropy signifies uncertainty.
○ The greater the value of entropy H(x) , the greater the uncertainty for probability
distribution and the smaller the value the less the uncertainty.
○ If the entropy is higher, the surety of the probability distribution function will be lesser, and
when the entropy is lower, the surety will be higher.
○ For a random variable X, having probability distribution as p(X), entropy is defined as:
Cross Entropy Loss
● Cross Entropy:
○ Also called logarithmic loss, log loss or logistic loss.
○ Each predicted class probability is compared to the actual class desired output 0 or 1 and a
loss is calculated that penalizes the probability based on how far it is from the actual
expected value.
○ The penalty is logarithmic in nature yielding a large score for large differences close to 1
and small score for small differences tending to 0.
○ A perfect model has a cross-entropy loss of 0.
○ Cross-entropy is defined as
Cross Entropy Loss
Cross Entropy Loss
● Example:
○ The cross-entropy is computed as follows
KL divergence Loss
● The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much one
probability distribution differs from another probability distribution.
● Lower the KL divergence value, the better we have matched the true distribution with our
approximation.
● The KL Divergence is not symmetric: that is
By
● Some models need substantial data to train upon, so in this case you would
optimize for the larger training sets.
● Models with very few hyperparameters will be easy to validate and tune, so you can
probably reduce the size of your validation set, but if your model has many
hyperparameters, you would want to have a large validation set as well.
● Also, if you have a model with no hyperparameters or ones that cannot be easily
tuned, you probably don’t need a validation set too!
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
Bias vs Variance trade-off
● As the model complexity increases trainerr becomes overly optimistic and gives us a
wrong picture of how close fˆ is to f
● The validation error gives the real picture of how close fˆ is to f
Deep Learning
By
● Generative adversarial networks (GANs): GAN algorithms can learn patterns from input datasets
and automatically create new examples which resemble training data.
● Neural style transfer: Neural style transfer models can blend content image and style image and
separate style from content.
● Reinforcement learning: Reinforcement learning models train software agents to attain their
goals and make decisions in a virtual environment.
What are the benefits of data augmentation?
● A significant challenge when training a machine learning model is deciding how many epochs to
run. Too few epochs might not lead to model convergence, while too many epochs could lead to
overfitting.
● Early stopping is an optimization technique used to reduce overfitting without compromising on
model accuracy. The main idea behind early stopping is to stop training before a model starts to
overfit.
Early Stopping
Early Stopping Approaches
By
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and
testing.
Model Overfitting
L1 Regularization
● Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds
“Absolute value of magnitude” of coefficient, as penalty term to the loss
function.
● Lasso shrinks the less important feature’s coefficient to zero; thus, removing some
feature altogether.
● So, this works well for feature selection in case we have a huge number of
features.
● L1 regularization is robust in dealing with outliers.
L2 Regularization
● The Regression model that uses L2 regularization is called Ridge Regression.
By
● Training Data: Vary the choice of data used to train each model in the ensemble.
● Ensemble Models: Vary the choice of the models used in the ensemble.
● Combinations: Vary the choice of the way that outcomes from ensemble members are
combined.
Bagging Approach
● It is a type of ensemble method.
● This approach is called bootstrap aggregation, and was designed for use with decision
trees that have high variance and low bias.
● Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples, selecting
observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.
Bagging Approach
Advantages of Ensemble Method
Optimization Algorithms
By
● Based on the error in various training models, the Gradient Descent learning algorithm
can be divided into:
a. Batch gradient descent,
b. Stochastic gradient descent, and
c. Mini-batch gradient descent.
Batch GD Optimization Algorithm
● This is a type of gradient descent which processes all the training examples for each
iteration of gradient descent.
● But if the number of training examples is large, then batch gradient descent is
computationally very expensive.
● Hence if the number of training examples is large, then batch gradient descent is
not preferred. Instead, we prefer to use stochastic gradient descent or mini-batch
gradient descent.
Stochastic GD Optimization Algorithm
● This is a type of gradient descent which processes 1 training example per iteration.
● Hence, the parameters are being updated even after one iteration in which only a
single example has been processed.
● Hence, this is quite faster than batch gradient descent.
● But again, when the number of training examples is large, even then it processes
only one example which can be additional overhead for the system as the number of
iterations will be quite large.
Mini-Batch GD Optimization Algorithm
● This is a type of gradient descent which works faster than both batch gradient
descent and stochastic gradient descent.
● Here b examples where b<m are processed per iteration.
● So even if the number of training examples is large, it is processed in batches of b
training examples in one go.
● Thus, it works for larger training examples and that too with lesser number of
iterations.
Convergence trends in different variants of GD:
● In case of Batch GD, the algorithm follows a straight path towards the global
minimum. Here the learning rate is typically held constant.
● In case of stochastic GD and mini-batch GD, the algorithm does not converge but
keeps on fluctuating around the global minimum.
● Therefore in order to make it converge, we have to slowly change the learning rate.
● However the convergence of Stochastic gradient descent is much noisier as in one
iteration, it processes only one training example.
Convergence trends in different variants of GD:
Convergence trends in different variants of GD:
● In case of Batch GD, the algorithm follows a straight path towards the global
minimum. Here the learning rate is typically held constant.
● In case of stochastic GD and mini-batch GD, the algorithm does not converge but
keeps on fluctuating around the global minimum.
● Therefore in order to make it converge, we have to slowly change the learning rate.
● However the convergence of Stochastic gradient descent is much noisier as in one
iteration, it processes only one training example.
GD with momentum
By
By
H2 = ?
W2 = ?
D2 = ?
Let us do a few exercises:
H2 = 55
W2 = 55
D2 = 96
Let us do a few exercises:
H2 = ?
W2 = ?
D2 = ?
Deep Learning
By
By
150
CNN
CNN
0
CNN
CNN
2400
CNN
CNN
0
CNN
CNN
48120
CNN
CNN
10164
CNN
26
2210
CNN
CNN
CNN
By
By
The reuse of a pre-trained model on a new problem is known as transfer learning
in machine learning.
A machine uses the knowledge learned from a prior assignment to increase
prediction about a new task in transfer learning.
The knowledge of an already trained machine learning model is transferred to a
different but closely linked problem throughout transfer learning.
For example, if you trained a simple classifier to predict whether an image
contains a backpack, you could use the model’s training knowledge to identify
other objects such as sunglasses.
How Transfer Learning Works?
In computer vision, neural networks typically aim to detect edges in the first
layer, forms in the middle layer, and task-specific features in the latter layers.
The early and central layers are employed in transfer learning, and the latter
layers are only retrained.
It makes use of the labelled data from the task it was trained on.
The example of a model that has been intended to identify a backpack in an
image and will now be used to detect sunglasses. Because the model has trained
to recognise objects in the earlier levels, we will simply retrain the subsequent
layers to understand what distinguishes sunglasses from other objects.
Uses of Transfer Learning
Transfer learning offers a number of advantages, the most important of which are
1) When we don’t have enough annotated data to train our model with.
2) When there is a pre-trained model that has been trained on similar data and tasks.
Deep Learning
By
By
By
By
Introduction to Autoencoders
By
h = g(W xi + b)
h = g(W xi + b)
x̂i = f(W h + c)
∗
h
W
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗
h = g(W xi + b)
x̂i = f(W h + c)
∗
h = g(W xi + b)
x̂i = f(W h + c)
∗
h
W
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
xi
x̂i = logistic(W h + c)
∗
0 1 1 0 1 (binary inputs)
xi
x̂i = logistic(W h + c)
∗
h = g(W xi + b)
W
xi
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs)
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted ˆxi to lie between [0,1] or [-1,1]
Again, g is typically chosen as the n
whereas we wantxˆi ∈ R
sigmoid function
W∗
xi
h = g(W xi + b)
x̂i = f(W h + c)
∗
min (x̂ ij − x ij ) 2
xi W,W∗
,c,b m
i=1 j=1
1 X
m
T
h = g(W xi + b) i.e., min (ˆxi − x i ) (ˆxi − x i )
W,W ∗ ,c,b m
x̂i = f(W h + c) i=1
∗
min (x̂ ij − x ij ) 2
xi W,W∗
,c,b m
i=1 j=1
1 X
m
T
h = g(W xi + b) i.e., min (ˆxi − x i ) (ˆxi − x i )
W,W ∗ ,c,b m
x̂i = f(W h + c) i=1
∗
W∗
h = g(W xi + b)
xi
0 1 1 0 1 (binary inputs)
If x ij = 1 ?
If x ij = 0 ?
Indeed the above function will be
minimized when x̂ ij = x ij !
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Pn ∂L (θ) ∂L (θ) ∂h 2 ∂a2
L (θ) = − (x ij log x̂ ij + (1 − x ij ) log(1 − x̂ ij )) =
j=1 ∂W ∗
∂h 2 ∂a2 ∂W ∗
h2 = ˆxi
a2 ∂L (θ) ∂L (θ) ∂h 2 ∂a2 ∂h 1 ∂a1
=
∂W ∂h 2 ∂a2 ∂h 1 ∂a1 ∂W
W∗
h1
a1
W
h0 = x i
By
P (e x ij = 0|x ij ) = q
x̃i x ij = x ij |x ij ) = 1 − q
P (e
x ij |x ij )
P (e
xi
x̃i
x ij |x ij )
P (e
xi
|xi | = 784 = 28 × 28
28*28
d
h∈R
|xi | = 784 = 28 × 28
d
h∈R
|xi | = 784 = 28 × 28
s.t. ||xi || 2 = x Ti xi = 1
malized so that kxi k = 1
W1
Solution: xi = p
W1T W1
T
max {W 1 xi }
xi
s.t. ||xi || 2 = x Ti xi = 1
W1
Solution: xi = p
W1T W1
Ω(θ)
0.2 ρ̂l
∂ l ρ̂
For each element in the above equation we can calculate∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix W jl :-
h Pm i
∂ m1 g W :,lT xi + bl
∂ ρ̂l i=1
=
∂W jl ∂W jl
h i
m ∂ gW T
1 X :,l xi + b l
=
m ∂W jl
i=1
1 X 0 T
m
= g W:,l xi + bl x ij
m
i=1
By
Ω(θ)
0.2 ρ̂l
∂ l ρ̂
For each element in the above equation we can calculate∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix W jl :-
h Pm i
∂ m1 g W :,lT xi + bl
∂ ρ̂l i=1
=
∂W jl ∂W jl
h i
m ∂ g W Tx + b
1 X :,l i l
=
m ∂W jl
i=1
1 X 0 T
m
= g W:,l xi + bl x ij
m
i=1
Xn Xk ∂h l 2
kJ x (h)k 2F =
∂x j
j=1 l=1
2
h Ω(θ) = λkθk Weight decaying
Xk ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
l=1
ρ̂l 1 − ˆρl
x̃i Xn Xk ∂h l 2
Ω(θ) = Contractive
ij x j=1 l=1
∂x j
xi
Variational Autoencoder
By
Loss Layer
Summary
AE Vs VAE
AE Vs VAE
Autoencoder (AE)
· The input of the decoder is stochastic and is sampled from a gaussian with mean and variance of the output of the encoder.
GANs Architecture
Training Procedure
Why GANs?
2
What are GANs?
Generative
Adversarial
Networks
Figure 1. The generator tries to generate fake images while taking random noise as input and the
discriminator tries to classify it as real or fake.
4 https://learnopencv.com/introduction-to-generative-adversarial-networks/#generator
Intuition behind GANs?
Figure 2. The counterfeiter trying to generate fake money using a feedback mechanism from the police.
5 https://learnopencv.com/introduction-to-generative-adversarial-networks/#generator
GANs Architecture
Figure 3. Block diagram of GANs. (Z is some random noise (Gaussian/Uniform). Z can be thought as the
latent representation of the image. )
6 https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016
Training Discriminator
7 https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016
Training Generator
8 https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016
GAN’s Formulation
minmaxV(D, G)
G D
Text-to-Image Synthesis
Data Augmentation
Low-Resolution to High-Resolution
Voice Translation
Mitesh M. Khapra
As usual we are given some training data (say, MNIST images) which obviously
comes from some underlying distribution
Our goal is to generate more images from this distribution (i.e., create images
which look similar to the images from the training data)
In other words, we want to sample from a complex high dimensional distribution
which is intractable (recall VAEs Models )
Mitesh M. Khapra
Complex Transformation
Sample Generated
z ∼ N (0, I)
GANs take a different approach to this problem where the idea is to sample
from a simple tractable distribution (say, z ∼ N (0, I)) and then learn a complex
transformation from this to the training distribution
In other words, we will take a z ∼ N (0, I), learn to make a series of complex
transformations on it so that the output looks as if it came from our training
distribution
Mitesh M. Khapra
What can we use for such a complex
transformation?
What can we use for such a complex
transformation? A Neural Network
What can we use for such a complex
transformation? A Neural Network
How do you train such a neural network?
Mitesh M. Khapra
What can we use for such a complex
transformation? A Neural Network
How do you train such a neural network? Using a
two player game
Mitesh M. Khapra
What can we use for such a complex
transformation? A Neural Network
Real or Fake How do you train such a neural network? Using a
two player game
Discriminator
There are two players in the game: a generator
and a discriminator
The job of the generator is to produce images
which look so natural that the discriminator
Generator Real Images
thinks that the images came from the real data
distribution
The job of the discriminator is to get better and
z ∼ N (0, I) better at distinguishing between true images and
generated (fake) images
So let’s look at the full picture
Real or Fake
Discriminator
z ∼ N (0, I)
So let’s look at the full picture
Let Gφ be the generator and D θ be the
Real or Fake discriminator (φ and θ are the parameters of G
and D, respectively)
Discriminator
We have a neural network based generator which
takes as input a noise vector z ∼ N (0, I) and
produces Gφ (z) = X
We have a neural network based discriminator
Generator Real Images
which could take as input a real X or a generated
X = G φ (z) and classify the input as real/fake
z ∼ N (0, I)
What should be the objective function of the
overall network?
Real or Fake Let’s look at the objective function of the
generator first
Discriminator
Given an image generated by the generator as
Gφ (z) the discriminator assigns a score D
θ (G φ (z))
to it
This score will be between 0 and 1 and will tell us
Generator Real Images
the probability of the image being real or fake
For a given z, the generator would want
to maximize log Dθ (G φ (z)) (log likelihood) or
z ∼ N (0, I) minimize log(1 − D θ (G φ (z)))
This is just for a single z and the generator would
like to do this for all possible values of z,
Real or Fake For example, if z was discrete and drawn from a
uniform distribution (i.e., p(z) = N1 ∀z) then the
Discriminator generator’s objective function would be
XN 1
min log(1 − D θ (G φ (z)))
φ N
i=1
Generator Real Images
However, in our case, z is continuous and not
uniform (z ∼ N (0, I)) so the equivalent objective
function would be
z ∼ N (0, I) ˆ
min p(z) log(1 − D θ (G φ (z)))
φ
Real or Fake
Discriminator
z ∼ N (0, I)
Now let’s look at the discriminator
The task of the discriminator is to assign a high
Real or Fake score to real images and a low score to fake images
And it should do this for all possible real images
Discriminator
and all possible fake images
In other words, it should try to maximize the
following objective function
Generator Real Images max E x∼p [log Dθ (x)]+E [log(1−D θ (G φ (z)))]
θ data z∼p(z)
z ∼ N (0, I)
If we put the objectives of the generator and
discriminator together we get a minimax game
Real or Fake
min max [Ex∼p data log Dθ (x)
φ θ
Discriminator
+ E z∼p(z) log(1 − D θ (G φ (z)))]
−2
−4
0 0.2 0.4 0.6 0.8 1
D(G(z))
When the sample is likely fake, we want
to give a feedback to the generator (using
4 gradients)
log(1 − D(g(x)))
− log(D(g(x))) However, in this region where D(G(z)) is close
2 to 0, the curve of the loss function is very flat
and the gradient would be close to 0
Loss
1 X h i
m
∇θ log Dθ x (i) + log 1 − D θ Gφ z(i)
m
i=1
7: end for
8: • Sample minibatch of m noise samples {z(1) , .., z(m) } from noise prior p g (z)
9: • Update the generator by ascending its stochastic gradient
1 X h i
m
∇φ log D θ Gφ z(i)
m
i=1
Mitesh M. Khapra
Generative Adversarial Networks - Architecture
Mitesh M. Khapra
We will now look at one of the popular neural networks used for the generator
and discriminator (Deep Convolutional GANs)
For discriminator, any CNN based classifier with 1 class (real) at the output
can be used (e.g.VGG, ResNet, etc.)
Figure: Generator (Redford et al 2015) (left) and discriminator (Yeh et al 2016) (right)
used in DCGAN
Architecture guidelines for stable Deep Convolutional GANs
Replace any pooling layers with strided convolutions (discriminator) and
fractional-strided convolutions (generator).
Use batchnorm in both the generator and the discriminator.
Remove fully connected hidden layers for deeper architectures.
Use ReLU activation in generator for all layers except for the output, which
uses tanh.
Use LeakyReLU activation in the discriminator for all layers
VAEs GANs
Abstraction Yes No
Generation Yes Yes
Compute P(X) Intractable No
Sampling Fast Fast