You are on page 1of 11

Variational Autoencoder

Introduction:
An autoencoder is a Deep Learning Model that has 2 Neural Networks,
an encoder that projects the data from input space to “latent space”. The
latent space is generally a lower dimensional sub-space, so the
corresponding data points in latent space of the inputs are
“representations” of them and they contain the useful information of the
input. The decoder will project these representations into an output space
that has same dimensions as the input space, essentially reconstructing
the inputs from their representations.
A Variational Autoencoder functions like an Autoencoder but the
representations in the latent space are modeled by probability
distributions rather than a single point. The advantage of using a
probability distribution rather than a single point is that we can sample
points from this distribution and pass them through the decoder neural
network to get meaning outputs.

A brief review of probability distributions:


To understand the mathematical formulation of the Variational
Autoencoder and how the probabilistic approach to latent space works,
there should be some understanding of probability distributions.
A probability distribution (p(.)) is a function (it is called the probability
density function or p.d.f) that spans across ℝ𝑑 (d is the dimension of that
space) and estimates the likelihood a sample “x” from that distribution
takes the value of p(x).
Any probability distribution can be described by its “mean” (µ), the
center point of the distribution and “variance” (𝜎 ), the spread of the
distribution around the center.
Probability distribution can also take discrete valued inputs. In this case,
it spans across ℕ𝑑 and it is called as probability mass function (p.m.f).

Construction of Variational Autoencoder:


The encoder of a VAE is a Neural Network with 2 outputs, the mean and
variance of the distribution for a given input. From the distribution
represented by the mean and variance obtained from encoder, a
“sampler” generates a point. This point has the properties of the
distribution from which it is samples and since that distribution is a
representation of the input, this point should have some useful
information about the input. The decoder will take this point as its input
and use the information in that point to reconstruct the input. By
enforcing that the reconstruction should be similar to input, the encoder
learns to make the distributions a representation of inputs, so that the
samples from this distribution have information about input.
Common terms and conventions used in mathematical expressions:

x - input
y - output (obtained from decoder)
z - latent variable
p(.) - probability distribution function for (.)
log(.) - logarithm of (.)
KL(.) - KL divergence of (.) (the definition of KL divergence is
provided in the next section)
∑ - Summation (It is assumed throughout the derivation that
summation is replaced by integration if the distributions are
continuous)
max(.), min(.) - maximum and minimum values of (.)
N(μ, 𝜎) - Normal distribution with mean μ and variance 𝜎
E(.) - Expectation of (.)
p, q – p(z|x) and q(z|x) respectively.

Some Definitions and Basic Principles used in sections below:

• KL Divergence: 𝐾𝐿(𝑞 ∥ 𝑝)  =   − ∑   𝑞 . log (𝑝𝑞)


o Where “q” and “p” are probability distributions, and it can be
read as KL divergence between q and p.
o KL Divergence measures how similar the 2 distributions are.
o If the value of KL divergence between q and p is smaller than
the 2 distributions are similar. It is like a distance function for
probability distributions, but it doesn’t follow all properties
that a distance measure follows as there cannot be any
physical distance between 2 functions.

• Bayes Rule: 𝑝( 𝑏∣𝑎 )⋅𝑝(𝑎)



𝑝( 𝑎 ∣ 𝑏 )  =   𝑝(𝑏)

o By the definition of conditional probability, 𝑝(𝑎|𝑏)  =   𝑝(𝑎∩𝑏) 𝑝(𝑏)

o So 𝑝(𝑎 ∩ 𝑏)  =  𝑝(𝑎|𝑏) ⋅ 𝑝(𝑏)


o Similarly, 𝑝(𝑏|𝑎)  =   𝑝(𝑎∩𝑏) 𝑝(𝑎)
. From this, 𝑝(𝑎 ∩ 𝑏)  =  𝑝(𝑏|𝑎) ⋅ 𝑝(𝑎).
o Hence 𝑝(𝑎|𝑏) ⋅ 𝑝(𝑏)  =  𝑝(𝑏|𝑎) ⋅ 𝑝(𝑎)  =  𝑝(𝑎 ∩ 𝑏)
o From this we obtain the Bayes Rule.
• 𝑝(𝑎)  =   ∑  𝑏 𝑝(𝑎 ∩ 𝑏)   =   ∑ 𝑏  𝑝(𝑎|𝑏) ⋅ 𝑝(𝑏)
• The function of a sampler is to randomly sample some points from
a distribution. The collective set of these points are called random
samples and random sampling means that the sampler will not
have any bias towards a particular set of values from the event-
space.
• Spatial Information: Spatial information of data is the
arrangement of coordinates of the points that we can understand.
o For example, the pixels are the coordinates for an image, and
they are arranged in a particular way that we perceive. If the
pixels are permuted, then the spatial information is lost.
o Similarly, audio data have a certain arrangement of
phonemes, and that arrangement has the information for us to
perceive the audio signal.
o The distributions in the latent space of variational
autoencoder should contain the spatial information of inputs.

Mathematical Formulation of VAE:


There are some important probability distributions that the mathematical
formulation of VAE depends on are,
• p(z|x) : When x is passed through the encoder, the mean and
variance of this distribution are obtained. This models how the
distributions in latent space are spread for a given input.
• q(z|x) : It approximates the interactable distribution p(z|x).
• p(z) : Prior distribution. It is like a hyper-parameter. This is fixed
throughout training and inference.
• p(x|z) : This is the distribution that we use when “Bayes Rule” is
applied to derive the formulation.
Here, the decoder of the model is completely deterministic, i.e., for a
given z, it produces the same y after it has been completely trained. So,
p(x|z) is same as p(x|y).

We need to formulate the distribution that spreads across the latent space
i.e., p(z|x). Using Bayes Rule, we can write the following,
𝑝( 𝑥 ∣𝑧 ).𝑝(𝑧)
𝑝( 𝑧 ∣ 𝑥 )  =   𝑝(𝑥)
and

𝑝(𝑥)  =     ∫ 𝑝( 𝑥 ∣ 𝑧 )𝑝(𝑧)

p(x) is “interactable distribution”. This integration does not have a


closed form solution. The value of z takes the values from all over real
numbers making this integration non-computable or interactable. So, a
distribution q(z|x) is assumed to exist and is very close to p(z|x). The
encoder neural network will output the mean and variance of q(z|x) and
the weights of this neural network needs to be optimized so that random
samples from q(z|x) contain the same information as the samples from
p(z|x)
Aim: To minimize the KL divergence between the distributions q(z|x)
and p(z|x).
𝑝(𝑧|𝑥 ) 𝑝(𝑥 |𝑧)⋅𝑝(𝑧)
𝐾𝐿(𝑞 ∥ 𝑝)  =   − ∑   𝑞(𝑧|𝑥) ⋅ log ( ); 𝑝(𝑧|𝑥)  =   (From bayes theorem)
𝑞(𝑧|𝑥 ) 𝑝(𝑥)

𝑝(𝑥 |𝑧)𝑝(𝑧)
=   − ∑   𝑞(𝑧|𝑥) ⋅ log (𝑝(𝑥)𝑞(𝑧|𝑥))

𝑝(𝑥 |𝑧)𝑝(𝑧) 1
=   − ∑   𝑞(𝑧|𝑥) {log ( 𝑞(𝑧|𝑥 )
⋅ (𝑝(𝑥)))}

𝑝(𝑥 |𝑧)𝑝(𝑧)
=   − ∑   𝑞(𝑧|𝑥) {log ( 𝑞(𝑧|𝑥 )
)− log(𝑝(𝑥))}

𝑝(𝑥 |𝑧)𝑝(𝑧)
=   ∑   𝑞(𝑧|𝑥) ⋅ log(𝑝(𝑥))   −   ∑   𝑞(𝑧|𝑥) {log ( 𝑞(𝑧|𝑥 )
)} …. (1)
The last line of the equation contains 2 summation terms, they can be
written as:
∑   𝑞(𝑧|𝑥) ⋅ log(𝑝(𝑥))   =   log 𝑝 (𝑥)

log(p(x)) is independent of z and since the summation is over the latent


space (z), it can be written as log(𝑝(𝑥)) ∑   𝑞(𝑧|𝑥) and ∑   𝑞(𝑧|𝑥)   =  1 because it is
a probability density function and when all the events of a probability
space are considered, the probability should be equal to 1.

So, the above equation can be written as,


𝑥 |𝑧)𝑝(𝑧)
𝐾𝐿(𝑞 ∥ 𝑝)  =   log 𝑝 (𝑥)  −  𝐿 Where, 𝐿  =   ∑   𝑞(𝑧|𝑥) {log (𝑝(𝑞( 𝑧|𝑥 ) )}
Adding L to both sides of the equation implies,
log 𝑝 (𝑥)  =  𝐾𝐿(𝑞 ∥ 𝑝) + 𝐿

For a given input “x”, the value of log p(x) is constant. This implies the
value of (𝐾𝐿(𝑞 ∥ 𝑝) + 𝐿) is a constant. So, minimizing 𝐾𝐿(𝑞 ∥ 𝑝) is also
maximizing the value of L. Hence the optimization problem can be
written as:
𝑝(𝑥 |𝑧)𝑝(𝑧)
max (∑   𝑞(𝑧|𝑥) {log ( )})
𝑞(𝑧|𝑥 )

𝑝(𝑧)
  =   max (∑ 𝑞(𝑧|𝑥) {log (𝑝(𝑥 |𝑧) + log ( ))})
𝑞(𝑧|𝑥)

𝑝(𝑧)
=   max (∑   𝑞(𝑧|𝑥) log(𝑝(𝑥|𝑧))   +   ∑   𝑞(𝑧) log (𝑞(𝑧|𝑥 )))

The terms of the above equation are the following:


𝑝(𝑧)
∑   𝑞(𝑧) log ( )  =  𝐾𝐿(𝑞(𝑧|𝑥) ∥ 𝑝(𝑧)) and
𝑞(𝑧|𝑥 )

∑ 𝑞(𝑧|𝑥) log(𝑝(𝑥|𝑧))   =  𝐸(log(𝑝(𝑥|𝑧)))


Since 𝑝(𝑥|𝑧) can be written as 𝑝(𝑥|𝑦), log(𝑝(𝑥|𝑧)) can be written as log(𝑝(𝑥|𝑦)).


This can be considered as a distance between the input “x” and output
“y” and E(.) is like average for random variable log(𝑝(𝑥|𝑦)). Hence the
whole term 𝐸(log(𝑝(𝑥|𝑧))) is equivalent to a “cost function” that measures
how similar the reconstruction of the decoder is to the input.
Maximizing 𝐸(log(𝑝(𝑥|𝑧))) makes sure that the variational autoencoder is
producing outputs that are reconstructions of the inputs and not some
random pixels arranged together. To get an idea of why log(𝑝(𝑥|𝑦)) is
similar to a distance function, let's consider an example where 𝑝(𝑥|𝑦) is a
gaussian distribution (for most general purposes, this will be a gaussian
distribution). Then, we can write the following:
(𝑥−𝑦)2
1 −
𝑝(𝑥|𝑦)  =   2𝜋𝜎 𝑒 𝜎2

log(𝑝(𝑥|𝑦))  ~  − (𝑥 − 𝑦)2 (The right-hand side of the equation should be multiplied by some
constant for equality)

We can see how this is reduced to Euclidean distance when Gaussian


Distribution is considered. Similarly, if x and y are discrete, p(x|y) will
be a Bernoulli distribution and the logarithm of p(x|y) will become cross
entropy loss function.

In statistics, p(z) is called as prior distribution, because we fix this


distribution to a known distribution (like general gaussian distribution
which is a gaussian with mean (µ) = 0 and variance (𝜎 ) = 1). Popularly a
general gaussian distribution is used but can be different if the input data
is not an image. Choosing the prior distribution comes from experience
and statistical knowledge of data. The 𝐾𝐿(𝑞(𝑧|𝑥) ∥ 𝑝(𝑧)) term regulates the
spread of data points projected to latent space. Without this term, the
distributions in the latent space will becomes point (variance = 0) and
VAE functions similar to a normal autoencoder.

We can summarize the mathematical formulation of the optimization


problem of VAE to this equation:
max (𝐸(log  𝑝(𝑥|𝑧)) + 𝐾𝐿(𝑞(𝑧|𝑥) ∥ 𝑝(𝑧)))

Theoretical Intuition of the objective for Variational Autoencoder:


A normal autoencoder encodes the inputs as lower dimensional points
into a latent space (in some cases, the points in latent space can be
higher dimensional compared to input space as well). Hence, we cannot
sample any points from the latent space as it is not covered by any
distribution. If we randomly pick a point from the latent space and pass
it through the decoder, we get some random output that is dimensionally
similar to data we trained the model on but doesn’t have any spatial
information. This is called a discriminative model, as it doesn’t consider
how the data that the model sees during training is distributed or how it
is generated from a distribution.

But a generative model (variational autoencoder in the family of


autoencoders is generative) considers a distribution (p(x)) from which
the training data is generated. This distribution is modeled in the latent
space by q(z|x). This distribution approximates the distribution p(z|x)
that models exactly how the corresponding points in latent space are
distributed for those training data points. Hence when we randomly
sample a point from the distribution q(z|x), and pass it through the
decoder function, we get some meaningful output (an output that is
dimensionally similar to training data and have similar spatial data).

The name “variation” comes from statistics. Variational inference is a


statistical method of approximating the intractable probabilistic
distributions with tractable ones. The main part VAE is how the
distribution p(z|x) is approximated by the distribution q(z|x) and
assuming “prior distributions” in the latent space – p(z).

Working of a trained Variational Autoencoder:


• The latent space of a Variational Autoencoder is filled with many
distributions corresponding to the training data points. All these are
probabilistic distributions that are similar to the prior distribution
p(z). Since we pick the prior distribution from a well-known
family of distributions, their summation is also a probabilistic
distribution. Let’s call this “Latent Distribution”.
• If we randomly sample some points from the latent distribution and
pass them through decoder, we get some meaningful outputs that
are not present in the training dataset out model has seen.
• The outputs obtained from the random samples are generally soft
at the edges. This is because the distributions from individual
training dataset points often collide in the latent space.
• If the data, our autoencoder models also has temporal information
then the prior distribution used in variational inference should be
changed accordingly and stochastic processes are a correct fit. For
example, gaussian distribution is suitable for image data. So, if
video data needs to be encoded, then Brownian motion might be a
suitable process for latent space.
• The mathematical expression obtained above describes that we
need to solve the constraint equation of max(E+KL(q||p(x))). But
we did not actually solve the optimization problem and obtained
the weights of encoder and decoder neural networks. As there are
complex neural networks involved in this function, they cannot be
solved by traditional methods. We need to use “Gradient Ascent”
on the concave function of (E + KL(q||p(x))). This involves
initializing the weights of the network and gradually changing
them so that the value of (E + KL(q||p(x))) increases gradually
until a maximum is reached. Then the training process comes to a
halt.

You might also like