Lecture 17 - KL Divergence, Autoencoders

Deep Learning
Vazgen Mikayelyan
December 12, 2020
V. Mikayelyan Deep Learning December 12, 2020 1 / 21

Outline
1 Kullback-Leibler Divergence
2 Autoencoders

KL Divergence
The KL divergence (also called relative entropy) is a measure of how one

probability distribution is different from a second, reference probability
distribution:
+∞
Z
q (x)
KL (P||Q) = − p (x) log dx
p (x)
−∞

KL Divergence

distribution:
+∞
Z
q (x) p
KL (P||Q) = − p (x) log dx = Ep log ,
p (x) q
−∞
where p and q are probability density functions of distributions P and Q.

KL Divergence

distribution:
+∞
Z
q (x) p
p (x) q
−∞

It can be proved that for all probability distributions P, Q, R
KL (P||Q) ≥ 0,

KL Divergence

distribution:
+∞
Z
q (x) p
p (x) q
−∞

KL (P||Q) ≥ 0,
KL (P||Q) = 0 if and only if P = Q,

KL Divergence

distribution:
+∞
Z
q (x) p
p (x) q
−∞

KL (P||Q) ≥ 0,
KL (P||Q) ≤ KL (P||R) + KL (R||Q),

KL Divergence

distribution:
+∞
Z
q (x) p
p (x) q
−∞

KL (P||Q) ≥ 0,
KL (P||Q) ≤ KL (P||R) + KL (R||Q),
but there is no symmetry, i.e. KL (P||Q) 6= KL (Q||P).

Jensen-Shannon Divergence
JS Divergence is the following

1 1
JS (P||Q) = KL (P||M) + KL (M||Q) ,
2 2
P +Q
where M = .
2

KL Divergence for Gaussians

Recall that probability density function (if it exists) of multivariate normal
distribution with mean µ and with (non-singular, symmetric, positive
definite) covariance matrix Σ is the following function:
n o
exp − 21 (x − µ)T Σ−1 (x − µ)
f (x) = q , x ∈ Rk .
k
(2π) |Σ|

n o
exp − 21 (x − µ)T Σ−1 (x − µ)
f (x) = q , x ∈ Rk .
k
(2π) |Σ|
Suppose that we have two multivariate normal distributions:

N1 (µ1 , Σ1 ) , N2 (µ2 , Σ2 ). Then

1 −1
T −1 |Σ2 |
KL (N1 , N2 ) = tr Σ2 Σ1 + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) − k + ln .
2 |Σ1 |

n o
exp − 21 (x − µ)T Σ−1 (x − µ)
f (x) = q , x ∈ Rk .
k
(2π) |Σ|
Suppose that we have two multivariate normal distributions:

N1 (µ1 , Σ1 ) , N2 (µ2 , Σ2 ). Then

1 −1
T −1 |Σ2 |
KL (N1 , N2 ) = tr Σ2 Σ1 + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) − k + ln .
2 |Σ1 |
In one dimensional case we will have
σ2 σ12 + (µ1 − µ2 )2 1
KL (N1 , N2 ) = log + − .
σ1 2σ22 2

Outline
1 Kullback-Leibler Divergence
2 Autoencoders

What is Unsupervised Learning?

Definition 1
Unsupervised learning is a machine learning technique that finds and
analyzes hidden patterns in unlabeled data.

Definition 1
Unsupervised learning is a machine learning technique that finds and
analyzes hidden patterns in unlabeled data.
Examples?

Autoencoders
Autoencoders are neural networks that aims to copy their inputs to their
outputs. They work by compressing the input into a latent-space
representation, and then reconstructing the output from this
representation.

Autoencoders
representation. This kind of network is composed of two parts
Encoder:

Autoencoders
Encoder:
This is the part of the network that compresses the input into a
latent-space representation.

Autoencoders
Encoder:
Decoder:

Autoencoders
Encoder:
Decoder:
This part aims to reconstruct the input from the latent space
representation.

Autoencoders

What are autoencoders used for?
Anomaly/Outlier detection.

Dimensionality reduction.

Data denoising.

Data denoising.
In a lot of different tasks.

Types of Autoencoders
Vanilla Autoencoders

Multilayer/Deep Autoencoders

Convolutional Autoencoders

Contractive Autoencoders

Regularized Autoencoders

Sparse Autoencoders

Sparse Autoencoders
Denoising Autoencoders

Sparse Autoencoders
Variational Autoencoders




A contractive autoencoder makes this encoding less sensitive to small

variations in its training dataset.


This is accomplished by adding a regularizer, or penalty term, to
whatever cost or objective function the algorithm is trying to minimize.


This is accomplished by adding a regularizer, or penalty term, to
whatever cost or objective function the algorithm is trying to minimize.
The end result is to reduce the learned representation’s sensitivity
towards the training input.

Let f is our encoder, g is the decoder and D is our training dataset. In the
previous cases we minimize this kind of loss function:
X
L (x, g (f (x))) .
x∈D

Let f is our encoder, g is the decoder and D is our training dataset. In the
previous cases we minimize this kind of loss function:
X
L (x, g (f (x))) .
x∈D
In the case of contractive autoencoders we will minimize this one

X
L (x, g (f (x))) + λ kJf (x)k2F ,
x∈D
where the added summand is the square of Frobenius norm of the following
Jacobian matrix:
∂fj (x)
[Jf (x)]i,j =
∂xi
i.e.
2
X ∂fj (x) 2
kJf (x)kF = .
∂xi
i,j

Sparse Autoencoders

Sparse Autoencoders
Sparse autoencoders have hidden nodes greater than input nodes.

They can still discover important features from the data.

Sparse Autoencoders
Sparse autoencoders have hidden nodes greater than input nodes.

They can still discover important features from the data.
Sparsity penalty is introduced on the hidden layer. This is to prevent
output layer copy input data. This prevents overfitting.

Sparse Autoencoders
Let f is our encoder, g is the decoder and D is our training dataset, which
has n samples. Denote
1X
ρj = fj (x) .
n
x∈D

Sparse Autoencoders
1X
ρj = fj (x) .
n
x∈D
We would like to (approximately) enforce the constraint ρj = ρ, where ρ is

a sparsity parameter, typically a small value close to zero (say ρ = 0.05).

Sparse Autoencoders
1X
ρj = fj (x) .
n
x∈D

To achieve this we will minimize the following loss function
X X
L (x, g (f (x))) + λ KL (ρ||ρj ) ,
x∈D j

Sparse Autoencoders
1X
ρj = fj (x) .
n
x∈D

To achieve this we will minimize the following loss function
X X
L (x, g (f (x))) + λ KL (ρ||ρj ) ,
x∈D j
where
ρj 1 − ρj
KL (ρ||ρj ) = −ρ log − (1 − ρ) log .
ρ 1−ρ


Denoising autoencoders create a corrupted copy of the input by

introducing some noise.


This helps to avoid the autoencoders to copy the input to the output
without learning features about the data.


This helps to avoid the autoencoders to copy the input to the output
without learning features about the data.
The model learns a vector field for mapping the input data towards a
lower dimensional manifold which describes the natural data to cancel
out the added noise.

Lecture 17 - KL Divergence, Autoencoders

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 17 - KL Divergence, Autoencoders

Uploaded by

Copyright:

Available Formats

Deep Learning

December 12, 2020

V. Mikayelyan Deep Learning December 12, 2020 1 / 21

V. Mikayelyan Deep Learning December 12, 2020 2 / 21

The KL divergence (also called relative entropy) is a measure of how one

V. Mikayelyan Deep Learning December 12, 2020 3 / 21

The KL divergence (also called relative entropy) is a measure of how one

where p and q are probability density functions of distributions P and Q.

V. Mikayelyan Deep Learning December 12, 2020 3 / 21

The KL divergence (also called relative entropy) is a measure of how one

where p and q are probability density functions of distributions P and Q.

V. Mikayelyan Deep Learning December 12, 2020 3 / 21

The KL divergence (also called relative entropy) is a measure of how one

where p and q are probability density functions of distributions P and Q.

V. Mikayelyan Deep Learning December 12, 2020 3 / 21

The KL divergence (also called relative entropy) is a measure of how one

where p and q are probability density functions of distributions P and Q.

V. Mikayelyan Deep Learning December 12, 2020 3 / 21

The KL divergence (also called relative entropy) is a measure of how one

where p and q are probability density functions of distributions P and Q.

V. Mikayelyan Deep Learning December 12, 2020 3 / 21

JS Divergence is the following

V. Mikayelyan Deep Learning December 12, 2020 4 / 21

V. Mikayelyan Deep Learning December 12, 2020 5 / 21

V. Mikayelyan Deep Learning December 12, 2020 5 / 21

Suppose that we have two multivariate normal distributions:

V. Mikayelyan Deep Learning December 12, 2020 5 / 21

Suppose that we have two multivariate normal distributions:

V. Mikayelyan Deep Learning December 12, 2020 5 / 21

V. Mikayelyan Deep Learning December 12, 2020 6 / 21

V. Mikayelyan Deep Learning December 12, 2020 7 / 21

V. Mikayelyan Deep Learning December 12, 2020 7 / 21

V. Mikayelyan Deep Learning December 12, 2020 7 / 21

V. Mikayelyan Deep Learning December 12, 2020 8 / 21

V. Mikayelyan Deep Learning December 12, 2020 8 / 21

V. Mikayelyan Deep Learning December 12, 2020 8 / 21

V. Mikayelyan Deep Learning December 12, 2020 8 / 21

V. Mikayelyan Deep Learning December 12, 2020 8 / 21

V. Mikayelyan Deep Learning December 12, 2020 9 / 21

V. Mikayelyan Deep Learning December 12, 2020 10 / 21

V. Mikayelyan Deep Learning December 12, 2020 10 / 21

V. Mikayelyan Deep Learning December 12, 2020 10 / 21

V. Mikayelyan Deep Learning December 12, 2020 10 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 11 / 21

V. Mikayelyan Deep Learning December 12, 2020 12 / 21

V. Mikayelyan Deep Learning December 12, 2020 13 / 21

V. Mikayelyan Deep Learning December 12, 2020 14 / 21

A contractive autoencoder makes this encoding less sensitive to small

V. Mikayelyan Deep Learning December 12, 2020 15 / 21

A contractive autoencoder makes this encoding less sensitive to small

V. Mikayelyan Deep Learning December 12, 2020 15 / 21

A contractive autoencoder makes this encoding less sensitive to small

V. Mikayelyan Deep Learning December 12, 2020 15 / 21

V. Mikayelyan Deep Learning December 12, 2020 16 / 21

In the case of contractive autoencoders we will minimize this one

V. Mikayelyan Deep Learning December 12, 2020 16 / 21