You are on page 1of 54

Deep Learning

Vazgen Mikayelyan

December 12, 2020

V. Mikayelyan Deep Learning December 12, 2020 1 / 21


Outline

1 Kullback-Leibler Divergence

2 Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 2 / 21


KL Divergence

The KL divergence (also called relative entropy) is a measure of how one


probability distribution is different from a second, reference probability
distribution:
+∞
Z
q (x)
KL (P||Q) = − p (x) log dx
p (x)
−∞

V. Mikayelyan Deep Learning December 12, 2020 3 / 21


KL Divergence

The KL divergence (also called relative entropy) is a measure of how one


probability distribution is different from a second, reference probability
distribution:
+∞
Z  
q (x) p
KL (P||Q) = − p (x) log dx = Ep log ,
p (x) q
−∞

where p and q are probability density functions of distributions P and Q.

V. Mikayelyan Deep Learning December 12, 2020 3 / 21


KL Divergence

The KL divergence (also called relative entropy) is a measure of how one


probability distribution is different from a second, reference probability
distribution:
+∞
Z  
q (x) p
KL (P||Q) = − p (x) log dx = Ep log ,
p (x) q
−∞

where p and q are probability density functions of distributions P and Q.


It can be proved that for all probability distributions P, Q, R
KL (P||Q) ≥ 0,

V. Mikayelyan Deep Learning December 12, 2020 3 / 21


KL Divergence

The KL divergence (also called relative entropy) is a measure of how one


probability distribution is different from a second, reference probability
distribution:
+∞
Z  
q (x) p
KL (P||Q) = − p (x) log dx = Ep log ,
p (x) q
−∞

where p and q are probability density functions of distributions P and Q.


It can be proved that for all probability distributions P, Q, R
KL (P||Q) ≥ 0,
KL (P||Q) = 0 if and only if P = Q,

V. Mikayelyan Deep Learning December 12, 2020 3 / 21


KL Divergence

The KL divergence (also called relative entropy) is a measure of how one


probability distribution is different from a second, reference probability
distribution:
+∞
Z  
q (x) p
KL (P||Q) = − p (x) log dx = Ep log ,
p (x) q
−∞

where p and q are probability density functions of distributions P and Q.


It can be proved that for all probability distributions P, Q, R
KL (P||Q) ≥ 0,
KL (P||Q) = 0 if and only if P = Q,
KL (P||Q) ≤ KL (P||R) + KL (R||Q),

V. Mikayelyan Deep Learning December 12, 2020 3 / 21


KL Divergence

The KL divergence (also called relative entropy) is a measure of how one


probability distribution is different from a second, reference probability
distribution:
+∞
Z  
q (x) p
KL (P||Q) = − p (x) log dx = Ep log ,
p (x) q
−∞

where p and q are probability density functions of distributions P and Q.


It can be proved that for all probability distributions P, Q, R
KL (P||Q) ≥ 0,
KL (P||Q) = 0 if and only if P = Q,
KL (P||Q) ≤ KL (P||R) + KL (R||Q),
but there is no symmetry, i.e. KL (P||Q) 6= KL (Q||P).

V. Mikayelyan Deep Learning December 12, 2020 3 / 21


Jensen-Shannon Divergence

JS Divergence is the following


1 1
JS (P||Q) = KL (P||M) + KL (M||Q) ,
2 2
P +Q
where M = .
2

V. Mikayelyan Deep Learning December 12, 2020 4 / 21


KL Divergence for Gaussians

V. Mikayelyan Deep Learning December 12, 2020 5 / 21


KL Divergence for Gaussians
Recall that probability density function (if it exists) of multivariate normal
distribution with mean µ and with (non-singular, symmetric, positive
definite) covariance matrix Σ is the following function:
n o
exp − 21 (x − µ)T Σ−1 (x − µ)
f (x) = q , x ∈ Rk .
k
(2π) |Σ|

V. Mikayelyan Deep Learning December 12, 2020 5 / 21


KL Divergence for Gaussians
Recall that probability density function (if it exists) of multivariate normal
distribution with mean µ and with (non-singular, symmetric, positive
definite) covariance matrix Σ is the following function:
n o
exp − 21 (x − µ)T Σ−1 (x − µ)
f (x) = q , x ∈ Rk .
k
(2π) |Σ|

Suppose that we have two multivariate normal distributions:


N1 (µ1 , Σ1 ) , N2 (µ2 , Σ2 ). Then
 
1 −1
 T −1 |Σ2 |
KL (N1 , N2 ) = tr Σ2 Σ1 + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) − k + ln .
2 |Σ1 |

V. Mikayelyan Deep Learning December 12, 2020 5 / 21


KL Divergence for Gaussians
Recall that probability density function (if it exists) of multivariate normal
distribution with mean µ and with (non-singular, symmetric, positive
definite) covariance matrix Σ is the following function:
n o
exp − 21 (x − µ)T Σ−1 (x − µ)
f (x) = q , x ∈ Rk .
k
(2π) |Σ|

Suppose that we have two multivariate normal distributions:


N1 (µ1 , Σ1 ) , N2 (µ2 , Σ2 ). Then
 
1 −1
 T −1 |Σ2 |
KL (N1 , N2 ) = tr Σ2 Σ1 + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) − k + ln .
2 |Σ1 |
In one dimensional case we will have
σ2 σ12 + (µ1 − µ2 )2 1
KL (N1 , N2 ) = log + − .
σ1 2σ22 2

V. Mikayelyan Deep Learning December 12, 2020 5 / 21


Outline

1 Kullback-Leibler Divergence

2 Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 6 / 21


What is Unsupervised Learning?

V. Mikayelyan Deep Learning December 12, 2020 7 / 21


What is Unsupervised Learning?

Definition 1
Unsupervised learning is a machine learning technique that finds and
analyzes hidden patterns in unlabeled data.

V. Mikayelyan Deep Learning December 12, 2020 7 / 21


What is Unsupervised Learning?

Definition 1
Unsupervised learning is a machine learning technique that finds and
analyzes hidden patterns in unlabeled data.

Examples?

V. Mikayelyan Deep Learning December 12, 2020 7 / 21


Autoencoders

Autoencoders are neural networks that aims to copy their inputs to their
outputs. They work by compressing the input into a latent-space
representation, and then reconstructing the output from this
representation.

V. Mikayelyan Deep Learning December 12, 2020 8 / 21


Autoencoders

Autoencoders are neural networks that aims to copy their inputs to their
outputs. They work by compressing the input into a latent-space
representation, and then reconstructing the output from this
representation. This kind of network is composed of two parts
Encoder:

V. Mikayelyan Deep Learning December 12, 2020 8 / 21


Autoencoders

Autoencoders are neural networks that aims to copy their inputs to their
outputs. They work by compressing the input into a latent-space
representation, and then reconstructing the output from this
representation. This kind of network is composed of two parts
Encoder:
This is the part of the network that compresses the input into a
latent-space representation.

V. Mikayelyan Deep Learning December 12, 2020 8 / 21


Autoencoders

Autoencoders are neural networks that aims to copy their inputs to their
outputs. They work by compressing the input into a latent-space
representation, and then reconstructing the output from this
representation. This kind of network is composed of two parts
Encoder:
This is the part of the network that compresses the input into a
latent-space representation.
Decoder:

V. Mikayelyan Deep Learning December 12, 2020 8 / 21


Autoencoders

Autoencoders are neural networks that aims to copy their inputs to their
outputs. They work by compressing the input into a latent-space
representation, and then reconstructing the output from this
representation. This kind of network is composed of two parts
Encoder:
This is the part of the network that compresses the input into a
latent-space representation.
Decoder:
This part aims to reconstruct the input from the latent space
representation.

V. Mikayelyan Deep Learning December 12, 2020 8 / 21


Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 9 / 21


What are autoencoders used for?

Anomaly/Outlier detection.

V. Mikayelyan Deep Learning December 12, 2020 10 / 21


What are autoencoders used for?

Anomaly/Outlier detection.
Dimensionality reduction.

V. Mikayelyan Deep Learning December 12, 2020 10 / 21


What are autoencoders used for?

Anomaly/Outlier detection.
Dimensionality reduction.
Data denoising.

V. Mikayelyan Deep Learning December 12, 2020 10 / 21


What are autoencoders used for?

Anomaly/Outlier detection.
Dimensionality reduction.
Data denoising.
In a lot of different tasks.

V. Mikayelyan Deep Learning December 12, 2020 10 / 21


Types of Autoencoders

Vanilla Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders
Convolutional Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders
Convolutional Autoencoders
Contractive Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders
Convolutional Autoencoders
Contractive Autoencoders
Regularized Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders
Convolutional Autoencoders
Contractive Autoencoders
Regularized Autoencoders
Sparse Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders
Convolutional Autoencoders
Contractive Autoencoders
Regularized Autoencoders
Sparse Autoencoders
Denoising Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Types of Autoencoders

Vanilla Autoencoders
Multilayer/Deep Autoencoders
Convolutional Autoencoders
Contractive Autoencoders
Regularized Autoencoders
Sparse Autoencoders
Denoising Autoencoders
Variational Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 11 / 21


Vanilla Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 12 / 21


Multilayer/Deep Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 13 / 21


Convolutional Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 14 / 21


Contractive Autoencoders

A contractive autoencoder makes this encoding less sensitive to small


variations in its training dataset.

V. Mikayelyan Deep Learning December 12, 2020 15 / 21


Contractive Autoencoders

A contractive autoencoder makes this encoding less sensitive to small


variations in its training dataset.
This is accomplished by adding a regularizer, or penalty term, to
whatever cost or objective function the algorithm is trying to minimize.

V. Mikayelyan Deep Learning December 12, 2020 15 / 21


Contractive Autoencoders

A contractive autoencoder makes this encoding less sensitive to small


variations in its training dataset.
This is accomplished by adding a regularizer, or penalty term, to
whatever cost or objective function the algorithm is trying to minimize.
The end result is to reduce the learned representation’s sensitivity
towards the training input.

V. Mikayelyan Deep Learning December 12, 2020 15 / 21


Contractive Autoencoders
Let f is our encoder, g is the decoder and D is our training dataset. In the
previous cases we minimize this kind of loss function:
X
L (x, g (f (x))) .
x∈D

V. Mikayelyan Deep Learning December 12, 2020 16 / 21


Contractive Autoencoders
Let f is our encoder, g is the decoder and D is our training dataset. In the
previous cases we minimize this kind of loss function:
X
L (x, g (f (x))) .
x∈D

In the case of contractive autoencoders we will minimize this one


X 
L (x, g (f (x))) + λ kJf (x)k2F ,
x∈D

where the added summand is the square of Frobenius norm of the following
Jacobian matrix:
∂fj (x)
[Jf (x)]i,j =
∂xi
i.e.
2
X  ∂fj (x) 2
kJf (x)kF = .
∂xi
i,j

V. Mikayelyan Deep Learning December 12, 2020 16 / 21


Sparse Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 17 / 21


Sparse Autoencoders

Sparse autoencoders have hidden nodes greater than input nodes.


They can still discover important features from the data.

V. Mikayelyan Deep Learning December 12, 2020 18 / 21


Sparse Autoencoders

Sparse autoencoders have hidden nodes greater than input nodes.


They can still discover important features from the data.
Sparsity penalty is introduced on the hidden layer. This is to prevent
output layer copy input data. This prevents overfitting.

V. Mikayelyan Deep Learning December 12, 2020 18 / 21


Sparse Autoencoders

Let f is our encoder, g is the decoder and D is our training dataset, which
has n samples. Denote
1X
ρj = fj (x) .
n
x∈D

V. Mikayelyan Deep Learning December 12, 2020 19 / 21


Sparse Autoencoders

Let f is our encoder, g is the decoder and D is our training dataset, which
has n samples. Denote
1X
ρj = fj (x) .
n
x∈D

We would like to (approximately) enforce the constraint ρj = ρ, where ρ is


a sparsity parameter, typically a small value close to zero (say ρ = 0.05).

V. Mikayelyan Deep Learning December 12, 2020 19 / 21


Sparse Autoencoders

Let f is our encoder, g is the decoder and D is our training dataset, which
has n samples. Denote
1X
ρj = fj (x) .
n
x∈D

We would like to (approximately) enforce the constraint ρj = ρ, where ρ is


a sparsity parameter, typically a small value close to zero (say ρ = 0.05).
To achieve this we will minimize the following loss function
X X
L (x, g (f (x))) + λ KL (ρ||ρj ) ,
x∈D j

V. Mikayelyan Deep Learning December 12, 2020 19 / 21


Sparse Autoencoders

Let f is our encoder, g is the decoder and D is our training dataset, which
has n samples. Denote
1X
ρj = fj (x) .
n
x∈D

We would like to (approximately) enforce the constraint ρj = ρ, where ρ is


a sparsity parameter, typically a small value close to zero (say ρ = 0.05).
To achieve this we will minimize the following loss function
X X
L (x, g (f (x))) + λ KL (ρ||ρj ) ,
x∈D j

where
ρj 1 − ρj
KL (ρ||ρj ) = −ρ log − (1 − ρ) log .
ρ 1−ρ

V. Mikayelyan Deep Learning December 12, 2020 19 / 21


Denoising Autoencoders

V. Mikayelyan Deep Learning December 12, 2020 20 / 21


Denoising Autoencoders

Denoising autoencoders create a corrupted copy of the input by


introducing some noise.

V. Mikayelyan Deep Learning December 12, 2020 21 / 21


Denoising Autoencoders

Denoising autoencoders create a corrupted copy of the input by


introducing some noise.
This helps to avoid the autoencoders to copy the input to the output
without learning features about the data.

V. Mikayelyan Deep Learning December 12, 2020 21 / 21


Denoising Autoencoders

Denoising autoencoders create a corrupted copy of the input by


introducing some noise.
This helps to avoid the autoencoders to copy the input to the output
without learning features about the data.
The model learns a vector field for mapping the input data towards a
lower dimensional manifold which describes the natural data to cancel
out the added noise.

V. Mikayelyan Deep Learning December 12, 2020 21 / 21

You might also like