You are on page 1of 27

Introduction To Autoencoders

A Brief Overview

Autoencoders are neural network-based models that are used for


unsupervised learning purposes to discover underlying correlations
among data and represent data in a smaller dimension. The
autoencoders frame unsupervised learning problems as supervised
learning problems to train a neural network model. The input only is
passed a the output. The input is squeezed down o a lower encoded
representation using an encoder network, then a decoder network
decodes the encoding to recreate back the input.

The encoding produced by the encoder layer has a lower-dimensional


representation of the data and shows several interesting complex
relationships among data.

An Autoencoder has the following parts:

1. Encoder: The encoder is the part of the network which takes


in the input and produces a lower Dimensional encoding
2. Bottleneck: It is the lower dimensional hidden layer where
the encoding is produced. The bottleneck layer has a lower
number of nodes and the number of nodes in the bottleneck
layer also gives the dimension of the encoding of the input.
3. Decoder: The decoder takes in the encoding and recreates
back the input.
Image by author

The bottleneck layer is the lower dimension layer. In the diagram, we


have the neural networks encoder and decoder. Phi and Theta are the
representing parameters of the encoder and decoder respectively.

The target of this model is such that the Input is equivalent to the
Reconstructed Output. To achieve this we minimize a loss function
named Reconstruction Loss. Basically, Reconstruction Loss is given
by the error between the input and the reconstructed output. It is
usually given by the Mean Square error or Binary Crossentropy
between the input and reconstructed output. Binary Crossentropy is
used if the data is binary.

Now, we have a basic understanding of encoders. So, let’s understand a


basic tradeoff we need to know while designing an autoencoder. We
have to keep in mind that the reason behind using an autoencoder is
that we want to understand and represent only the deep correlations
and relationships among data. We need a generalized lower-
dimensional representation. That is why, if the features of the data are
not correlated at all then it is hard for an autoencoder to represent the
data in a lower dimension. If while designing the neural network, we
use a very large number of nodes in the bottleneck layer, it will create a
large dimensional encoding. The problem that exists here is, the
network might cheat and overfit to the input data by simply
remembering the input data. In this case, we will not be able to get the
correct relationships in our encodings. Again, if we use a shallow
network with a very less number of nodes, it will be very hard to
capture all the relationships. So, we must be very careful during
designing the network.

Now, a question may arise, why go for autoencoder, when we


have methods like PCA for dimensionality reduction?

Well, here goes the explanation. PCA or principal component analysis


tries to find lower-dimensional orthogonal hyperplanes that describe
the original data by capturing the maximum possible variance in the
data and the important correlations consequently. We need to focus on
the fact that we are talking about finding a hyperplane, so it's linear.
But often correlations are non-linear, which are not covered by PCA.
Source

As we can see in the above diagram autoencoders cover non-linear data


dependencies, thus are a better way than PCA for dimensionality
reduction.

Let’s look at some of the applications of autoencoders:

1. Autoencoders are used largely for anomaly


detection: As we know, autoencoders create encodings that
basically captures the relationship among data. Now, if we
train our autoencoder on a particular dataset, the encoder and
decoder parameters will be trained to represent the
relationships on the datasets the best way. Thus, will be able
to recreate any data given from that kind of dataset in the best
way. So, if data from that particular dataset is sent through
the autoencoder, the reconstruction error is less. But if some
other kind of data is sent through the autoencoder it will
generate a huge reconstruction error. If we are able to apply a
correct cutoff we will be able to create an anomaly detector.
2. Autoencoders are used for Noise Removal: If we can
pass the noisy data as input and clean data as output and train
an autoencoder on such given data pairs, trained
Autoencoders can be highly useful for noise removal. This is
because noise points usually do not have any correlations.
Now, as the autoencoders need to represent the data in the
lowest dimensions, the encodings usually have only the
important relations there exists, rejecting the random ones.
So, the decoded data coming out as output of an autoencoder
is free of all the extra relations and hence the noise.
3. Autoencoders as Generative models: Before the GAN’s
came into existence, Autoencoders were used as generative
models. One of the modified approaches of autoencoders,
variational autoencoders are used for generative purposes.
4. Autoencoders used for collaborative
filtering: Collaborative filtering normally uses matrix
factorization methods, but autoencoders can learn the
dependencies and learn to predict the item-user matrix
Types of Autoencoders

Several kinds of Autoencoders have been developed to answer the


different tradeoffs. Let’s look at some of them.

Undercomplete Autoencoder

The undercomplete autoencoders are the simplest architecture for


autoencoders. The architecture depends on putting constraints on the
number of nodes that can be added to the hidden layers and the central
bottleneck. The theory behind this is, the approach tries to restrict the
flow of information through the network. The architecture depends on
the fact, that if the flow of information is less and the network needs to
learn the encoding the best way, it will only consider the most
important dependencies and reject the rest. Thus we will be able to
create the encoding for best reconstruction.

The loss function used is normal reconstruction error loss, which is


MSE or Binary Crossentropy. As we are restricting the flow of
information using the bottleneck, there is no chance that the model
memorizes the input and cheats.
Image by Author

The above diagram shows an undercomplete autoencoder. We can see


the hidden layers have a lower number of nodes.

Let’s see the application of TensorFlow for creating undercomplete


autoencoder.

Simple Undercomplete autoencoder:


import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense,Input(x_train,_), (x_test,_)= mnist.load_data()x_train =
x_train.astype('float32') / 255
x_test = x_test.astype('float32') /
255x_train=x_train.reshape((len(x_train),np.prod(x_train.shape[1:])))
x_test=x_test.reshape((len(x_test),np.prod(x_test.shape[1:])))## Shape of the x_train is (60000, 784)
## Shape of the x_test is (10000,784)input_l=Input(shape=(784,))
bottleneck=Dense(32, activation='relu')(input_l)
output_l=Dense(784, activation='sigmoid')
(bottleneck)autoencoder=Model(inputs=[input_l],outputs=[output_l]) ## Building the entire
autoencoderencoder=Model(inputs=[input_l],outputs=[bottleneck]) ## Building the
encoderencoded_input=Input(shape=(32,))
decoded=autoencoder.layers[-1]
(encoded_input)decoder=Model(inputs=[encoded_input],outputs=[decoded]) ##Building the
decoderautoencoder.compile(optimizer='adam', loss='binary_crossentropy')autoencoder.fit(x_train,
x_train,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))

The models created by the above code are:


Image by author

The first model is the decoder, the second is the full autoencoder and
the third is the encoder model. The bottleneck layer is the place where
the encoded image is generated.

We use the autoencoder to train the model and get the weights that can
be used by the encoder and the decoder models.

If we send image encodings through the decoders, we will see that the
images are reconstructed back.

Image by author

The upper row is the original images and the lower row is the images
created from the encodings by the decoder.
Now, the images are of dimensions 28x28, and we have created
encodings of dimensions of 32. if we represent the encodings as 16x2, it
will look something like this:

Image by author

The lower row represents the corresponding encodings.

As we can see here, we have built a very shallow network, we can build
a deep network, as the shallow networks may not be able to uncover all
the underlying features, but we need to be very careful about
restricting the number of hidden nodes.
input_l=Input(shape=(784,))encoding_1=Dense(256, activation='relu')(input_l)
encoding_2=Dense(128, activation='relu')(encoding_1)bottleneck=Dense(32, activation='relu')
(encoding_2)decoding_1=Dense(128, activation='relu')(bottleneck)
decoding_2=Dense(256, activation='relu')(decoding_1)output_l=Dense(784, activation='sigmoid')
(decoding_2)autoencoder=Model(inputs=[input_l],outputs=[output_l])encoder=Model(inputs=[input_
l],outputs=[bottleneck])encoded_input=Input(shape=(32,))decoded_layer_1=autoencoder.layers[-3]
(encoded_input)
decoded_layer_2=autoencoder.layers[-2](decoded_layer_1)decoded=autoencoder.layers[-1]
(decoded_layer_2)decoder=Model(inputs=[encoded_input],outputs=[decoded])

The above implemented as a deep undercomplete network.

Sparse Autoencoders
When we were talking about the undercomplete autoencoders, we told
we restrict the number of nodes in the hidden layer to restrict the data
flow. But often this approach creates issues because the limitation on
the nodes of the hidden layers and shallower networks prevent the
neural network to uncover complex relationships among the data
items. So, we need to use deeper networks with more hidden layer
nodes. Again, if we use more hidden layer nodes, the network may just
memorize the input and overfit, which will make our intentions void.
So, to solve this we use regularizers. The regularizers prevent the
network from overfitting to the input data and prevent the
memorization problem.

During regularization, we normally regularize weights but in this case,


we regularize activations that are actually passed from one hidden
layer to another. In simpler words, the idea is we won’t let all the nodes
in the hidden layers learn. Now, if we go to the basics of neural
networks, an activation function controls how much information a
particular node passes. The activation function works like a gate. If the
activation for a particular node is 0, then the node is not contributing
its information. The idea of sparse autoencoders is something like that.

Now, one thing to note is, the activations are dependent on the input
data ad will change with the change in input. So, we let our model
decide the activations and penalize their activation values. We usually
do this in two ways:
L1 Regularization: L1 regularizers restrict the activations as
discussed above. It forces the network to use only the nodes of the
hidden layers that handle a high amount of information and block the
rest.

It is given by:

The reconstruction loss is given by L and the second part is the


regularizers that penalize the activations. As we can see the regularizer
part is a summation of activations of all nodes in the hidden layer h.
So, when we try to minimize the loss function we decrease the
activations. Again, we use a tuning parameter lambda. Lambda helps
to ensure how much attention we want to pay for the regularization
aspect.

KL Divergence: Kullback-Leibler Divergence is a way to measure the


difference and similarity between two mathematical probability
distributions. It is given by:
So, basically, it tells us how similar p and q are. This method uses a
sparsity parameter ρ (Rho). Rho is said to be the average activation of a
neuron over a set of samples. The idea is to use a very low Rho value
such that the neuron or the nodes keep a low value as average and in
order to achieve that the node will have just 0 activations for some of
the samples in the collection, where it is not essential.

Now, the question is how does the KL divergence help. For this, we will
need to know what is a Bernoulli Distribution.

In probability theory and statistics, the Bernoulli distribution, is


the discrete probability distribution of a random variable which
takes the value 1 with probability p and the value 0 with
probability q=1-p

So, basically it a binary level probability distribution. We want


something similar to our nodes. We want it to fire with a probability
and so its distribution can be similar to a Bernoulli distribution. Now,
for a particular neuron j we can calculate Rho as:

where m is the number of observations and a is the activation of the


neuron in the hidden layer h. The loss is given by:
A visualization will look like this:

Image by author

The above image shows the light red nodes do not fire.

Let’s see the application of TensorFlow for creating a sparse


autoencoder.
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense,Input
from tensorflow.keras.regularizers import l1(x_train,_), (x_test,_)= mnist.load_data()x_train =
x_train.astype('float32') / 255
x_test = x_test.astype('float32') /
255x_train=x_train.reshape((len(x_train),np.prod(x_train.shape[1:])))
x_test=x_test.reshape((len(x_test),np.prod(x_test.shape[1:])))## Shape of the x_train is (60000, 784)
## Shape of the x_test is (10000,784)input_l=Input(shape=(784,))encoding_1=Dense(256,
activation='relu', activity_regularizer=l1(0.001))(input_l)
bottleneck=Dense(32, activation='relu', activity_regularizer=l1(0.001))(encoding_1)
decoding_1=Dense(256, activation='relu', activity_regularizer=l1(0.001))(bottleneck)
output_l=Dense(784, activation='sigmoid')
(decoding_1)autoencoder=Model(inputs=[input_l],outputs=[output_l])encoder=Model(inputs=[input_
l],outputs=[bottleneck])encoded_input=Input(shape=(32,))
decoded_layer_2=autoencoder.layers[-2](encoded_input)
decoded=autoencoder.layers[-1]
(decoded_layer_2)decoder=Model(inputs=[encoded_input],outputs=[decoded])autoencoder.compile(
optimizer='adam', loss='binary_crossentropy')autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))

The above code uses an L1 regularizer.


Image by author
The images represent the full autoencoder, followed by the encoder
and the decoder.

Denoising Autoencoders

We already have talked about autoencoders used as noise removers.


So, the idea is in order to represent the underlying relations and
represent in the small size encoding the autoencoders only look at the
object image and not the noise, which is eliminated. Again, here we do
not need to restrict the number of nodes or use a regularizer as we have
a different input and output and the memorization problems do not
exist anymore.

Image by author

The above network represents denoising autoencoders.


input_l=Input(shape=(28,28,1))encoding_1=Conv2D(32, (3,3), activation='relu',padding='same')
(input_l)
maxp_1=MaxPooling2D((2,2), padding='same')(encoding_1)encoding_2=Conv2D(16, (3,3),
activation='relu',padding='same')(maxp_1)
maxp_2=MaxPooling2D((2,2), padding='same')(encoding_2)encoding_3=Conv2D(8, (3,3),
activation='relu',padding='same')(maxp_2)bottleneck=MaxPooling2D((2,2), padding='same')
(encoding_3)decoding_1=Conv2D(8, (3,3), activation='relu', padding='same')(bottleneck)
Up_1=UpSampling2D((2,2))(decoding_1)decoding_2=Conv2D(16, (3,3), activation='relu',
padding='same')(Up_1)
Up_2=UpSampling2D((2,2))(decoding_2)decoding_3=Conv2D(32, (3,3), activation='relu')(Up_2)
Up_3=UpSampling2D((2,2))(decoding_3)output_l= Conv2D(1,
(3,3),activation='sigmoid',padding='same')
(Up_3)autoencoder=Model(inputs=[input_l],outputs=[output_l])encoder=Model(inputs=[input_l],out
puts=[bottleneck])autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

The above code can be used to create the autoencoder.

The results on datasets are as follows:

Image by author

The above are the results of the fashion mnist datasets.

As you can see, I have used a convolutional network to create the


autoencoder.

The model structure is as below:


Image by author

Contractive Autoencoders

The principle that the contractive autoencoders are based on is pretty


similar to the denoising encoders. The idea is that the encodings
produced for similar inputs will be similar. In other words, if we
change the inputs or tweak them by just a little the encodings will
remain the same and show no changes. They are used for feature
extractions.

Autoencoders can be implemented using any kind of neural network,


like for image data we can use Convolutional Neural Nets and for time
series data we can use Recurrent Neural Nets.

There exists another type of autoencoders that are a bit different from
the above-stated ones which are called Variational Autoencoders.

Variational Autoencoders

To understand the concept we need to dive into a bit of basic.


Actually, what do we mean by the lower dimensional
encoding? The answer is, as we have seen above our above input had
784x1 or 28x28 dimension, when we encode it to a say much smaller
32x1 dimension, we basically mean that now we have 32 features which
are the most important features and reflect most of the information in
the data, or image.
So, say for a face, when we encode a face image of say 32x32
dimension, it has the full facial two-dimensional image, now, if we
encode it to 6x1 dimension, i.e, send it through a bottleneck layer of 6
nodes, we will basically get 6 features which contribute most or the
major information about the facial image. Say, for the 6 features we
have a smile, skin tone, gender, beard, wears glasses, and
hair color. Our encoding has a numerical value for each of these
features for a particular facial image. By sending the encodings
through a decoder we can reconstruct back the image.

The 6 features we talked about in the lower dimension encoding are


called latent features/attributes and the set of values feature can take is
its latent space.

Source

Now, different values of the latent attributes represent different images


as the feature varies as shown below.
Source

We have seen that the values of the latent attributes are always
discrete. This is where the variational autoencoders are different.
Instead of considering to pass discrete values, the variational
autoencoders pass each latent attribute as a probability distribution.
Some thing as shown below.
Source

So, our representing face example becomes as follows:

Source

Now, we can see each latent attribute is passed as a probability


distribution. The decoder samples from each latent distribution and
decodes to reconstruct the image. So, as the sampling is random and
not backpropagated the reconstructed image is similar to the input but
is not actually present in the input set. This is the reason for variational
autoencoders to be known as a generative network.

Source

The above image defines the situation. Now, to create a distribution for
each latent vector, the encoder in place of passing the value, pass the
mean and standard deviation of the distribution, which is used to
create construct the normal distribution.
Source

The above image shows the structure of a variational autoencoder. The


Probabilistic encoder is called the recognition model and the decoder is
called the generative model.

Now, as the z or the latent values are sampled randomly, they are
unknown and hence called hidden variables. Again, we know the goal
is such that our reconstructed output is equivalent to the input. So, our
goal is to find out what is the probability of a value to be in z or the
latent vector given that it is similar to x, P(z|x), because actually we
need to reconstruct x from z. In simpler words, we can see x but we
need to estimate z.
We obtain the above equation, using bayes theorem. This method
requires finding p(x), given by:

This problem is untractable or it won’t complete in polynomial time as,


it is a multiple integral problem and the number of integral increases
with the increase in latent attributes or encoding dimensions.

In order to solve this problem, we use another distribution q(z|x)


which is the approximation of p(z|x) and is designed to be a tractable
solution. Now, again to determine the fact that q(z|x) is similar to p(z|
x) we use KL divergence between the two distributions.

The variational autoencoders use a loss function as:

The first term is the reconstruction error and the second term is the KL
divergence between the two distributions. It ensures that distributions
are similar, as it minimizes the KL divergence to minimize the loss. We
use a trick called the Reparameterization Trick to resample latent
points from distributions without using back propagation.

You might also like