You are on page 1of 25

Word Count: 8151

 
Plagiarism Percentage 35%

Matches

World Wide Web Match


1
View Link

World Wide Web Match


2
View Link

World Wide Web Match


3
View Link

World Wide Web Match


4
View Link

World Wide Web Match


5
View Link

World Wide Web Match


6
View Link

World Wide Web Match


7
View Link

World Wide Web Match


8
View Link

World Wide Web Match


9
View Link

World Wide Web Match


10
View Link

World Wide Web Match


11
View Link

World Wide Web Match


12
View Link

World Wide Web Match


13
View Link

World Wide Web Match


14
View Link

World Wide Web Match


15
View Link

World Wide Web Match


16
View Link

World Wide Web Match


17
View Link

World Wide Web Match


18
View Link

World Wide Web Match


19
View Link

World Wide Web Match


20
View Link

World Wide Web Match


21
View Link

World Wide Web Match


22
View Link

World Wide Web Match


23
View Link

World Wide Web Match


24
View Link
World Wide Web Match
25
View Link

World Wide Web Match


26
View Link

World Wide Web Match


27
View Link

World Wide Web Match


28
View Link

World Wide Web Match


29
View Link

World Wide Web Match


30
View Link

World Wide Web Match


31
View Link

World Wide Web Match


32
View Link

World Wide Web Match


33
View Link

World Wide Web Match


34
View Link

World Wide Web Match


35
View Link

World Wide Web Match


36
View Link

World Wide Web Match


37
View Link

World Wide Web Match


38
View Link

World Wide Web Match


39
View Link

World Wide Web Match


40
View Link

World Wide Web Match


41
View Link

World Wide Web Match


42
View Link

World Wide Web Match


43
View Link

World Wide Web Match


44
View Link

World Wide Web Match


45
View Link

World Wide Web Match


46
View Link

Suspected Content
Dissertation on Automatic Image Colourization using Generative Models

Submitted in partial fulfillment of the requirements for the award of degree of 18


Bachelor of Technology in Computer Science & Engineering Submitted by:

Mudit Jha 01FB16ECS214 Saahil Jain 01FB16ECS321 Sayantan Nandy 01FB16ECS345 Under the
guidance of Internal Guide Prof. Suresh Jamadagni Associate Professor, PES University External Guide
Name of the Guide Designation, Company Name January – May 2020

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING FACULTY OF 35


ENGINEERING PES UNIVERSITY (Established under

Karnataka Act No. 16 of 2013) 100ft Ring Road, Bengaluru – 560 085, Karnataka, India PES UNIVERSITY
(Established under Karnataka Act No. 16 of 2013) 100ft Ring Road, Bengaluru – 560 085, Karnataka, India
FACULTY OF

ENGINEERING CERTIFICATE This is to certify that the dissertation entitled 31

Automatic Image Colourization using Generative Models is a bonafide work carried out by Mudit Jha
01FB16ECS214 Saahil Jain 01FB16ECS321 Sayantan Nandy 01FB16ECS345

In partial fulfilment for the completion of eighth semester project work in the 12
Program of Study Bachelor of Technology in Computer Science and Engineering
under rules and regulations of

PES University, Bengaluru during the period Jan. 2020 – May. 2020.

It is certified that all corrections / suggestions indicated for internal assessment 12


have been incorporated in the report. The dissertation has been approved as it
satisfies the 8th semester academic requirements in respect of project work. Signature
<Name of the Guide> Designation Signature

Dr. Shylaja S S Chairperson Signature Dr. B K Keshavan Dean of Faculty

External Viva Name of the Examiners Signature with Date 1. 20


__________________________ __________________________ 2.
__________________________ __________________________ DECLARATION We
hereby declare that the project entitled

Automatic Image Colourization using Generative Models has been carried out by us under the guidance of
Prof.Suresh Jamadagni,Associate Professor

and submitted in partial fulfillment of the course requirements for the award of 13
degree of Bachelor of Technology in Computer Science and Engineering of PES
University, Bengaluru during the academic

semester January – May 2020.

The matter embodied in this report has not been submitted to any other university 13
or institution for the award of any degree.
01FB16ECS214 Mudit Jha 01FB16ECS321 Saahil Jain 01FB16ECS345 Sayantan Nandy
ACKNOWLEDGEMENT We

would like to express my gratitude to Prof. 40

Suresh Jamadagni, Associate Professor from PES University, for his continuous guidance, assistance and
encouragement throughout the development of this project. We would also like to thank Dr.Mamta
HR,Dr.Jayashree R for all the support and guidance given to us while doing this project. We are grateful for
our Project Coordinator, Dr. Anant Koppar, for organising, managing and helping out with the entire process.
We take this opportunity to thank Dr. Shylaja S S, Chairperson,

Department of Computer Science and Engineering, PES University, for all the 37
knowledge and support

we have received from the department. We would like to thank Dr. B.K. Keshavan, Dean of Faculty, PES
University for his help. We are deeply grateful to Dr.

M. R. Doreswamy, Chancellor, PES University, Prof. Jawahar Doreswamy, Pro 24


Chancellor – PES University, Dr.

Suryaprasad J, Vice-Chancellor, PES University for providing to us various opportunities and enlightenment
every step of the way. Finally, this project could not have been completed without the continual support and
encouragement we have received from our parents,colleagues and friends. ABSTRACT Generative models
are becoming more and more common in applied Machine Learning as they offer a different view to the
classical deep learning approach of having multiple layers of neurons learn an abstraction of the input data
through backpropagation. Though this approach has served well and has expanded into many different
architectures and problem domains, generative models,namely Generative Adversarial Networks and
autoencoders offer a different approach: one where the model tries to learn the data distribution from which
the input dataset is sampled. Due to this approach, generative models are becoming increasingly common
in tasks which involve some generation of new data such as in image to image translation. Our problem
statement of image colorization falls under this very domain. Thus, a foray into this approach towards
colorizing images which is a step away from existing solutions which generally involve Convolutional Neural
Networks holds a lot of promise and is an area of active research as well. TABLE OF CONTENTS Chapter
No. 1. 2. 3. Title Page No. INTRODUCTION PROBLEM DEFINITION LITERATURE SURVEY 3.1 Exploring
Convolutional Neural Networks for Automatic Image Colorization 3.1.1 Introduction 3.1.2 Approach 3.1.3
Method 01 02 03 3.2

Image-to-Image Translation with Conditional Adversarial Networks 3

3.2.1 Introduction 3.2.2 Method 3.3 Introduction 3.3.1 Introduction 3.3.2 Network Architecture 4. PROJECT
REQUIREMENTS SPECIFICATION 5. SYSTEM REQUIREMENTS SPECIFICATION 6. SYSTEM DESIGN
7. DETAILED DESIGN 8. IMPLEMENTATION AND PSEUDOCODE 9. TESTING 10. RESULTS AND
DISCUSSION 11. SNAPSHOTS 12. CONCLUSIONS 13. FURTHER ENHANCEMENTS
REFERENCES/BIBLIOGRAPHY APPENDIX A DEFINITIONS, ACRONYMS AND ABBREVIATIONS
APPENDIX B USER MANUAL (OPTIONAL)

LIST OF FIGURES Figure No. Title Page No. 1 41

Autoencoder -Test Gray Images 2 Autoencoder -Generated Color Images 3 Autoencoder- Origin Color
Images 4 GAN- Test Gray Images 5 GAN- Origin Color Images 6 GAN- Generated Color Images 7 GAN-
Code 8 GAN - model build function 9 GAN - model build function 10 GAN- model initializer function 11 GAN-
load cifar dataset function 12 AutoEncoder- model initializer function 13 AutoEncoder - encoder model
summary 14 AutoEncoder -decoder model summary 31 32 33 34 35 35 37 38 39 40 41 42 43 44
_____________________________________________________________________________________
1. INTRODUCTION 1.1. Overview The

automatic colorization of grayscale images has been an active area of research in 21


machine learning for

a broad time frame.

This is due to the huge assortment of utilizations such as color restoration and 7
image colorization for animations.

Photography may appear to be the snap of a picture, but the best photographs often undergo intense after-
effects on the computer. Image colorization is one technique to add style to a photograph or apply a
combination of styles. Additionally, image colorization can add color to photographs that were originally
taken in black and white. This can be used to provide a best-guess as to the context of the picture, and help
bridge the gap between the past and the present. The goal of our model is to produce realistic colorized
photos. In 2014, Goodfellow

proposed a new type of generative model: generative adversarial networks (GANs). 6


A GAN is composed of two smaller networks called the generator and
discriminator. As the name suggests, the generator’s task is to produce results that are
indistinguishable from real data. The discriminator’s task is to classify whether a sample
came from the generator’s model distribution or the original data distribution. Both of
these subnetworks are trained simultaneously until the generator is able to consistently
produce results that the discriminator cannot classify. The architectures of the generator
and discriminator both follow a multilayer perceptron model. Since colorization is a class
of image translation problems, the generator and discriminator are both convolutional
neural networks (CNNs). The generator is represented by the mapping G(z; θG), where z
is a noise variable (uniformly distributed) that acts as the input of the generator. Similarly,
the discriminator is
_____________________________________________________________________________________
_____________________________________________________________________________________
represented by the mapping D(x; θD) to produce a scalar between 0 and 1, where x is a
color image. The output of the discriminator can be interpreted as the probability of the
input originating from the training data. These constructions of G and D enable us to
determine the optimization problem for training the generator and discriminator: G is
trained to minimize the probability that the discriminator makes a correct prediction in
generated data, while D is trained to maximize the probability of assigning the correct
label.

Image colorization is an image-to-image translation problem that maps a high 7


dimensional input to a high dimensional output. It can be seen as pixel-wise
regression problem where structure in the input is highly aligned with structure in the
output. That means the network needs not only to generate an output with the same
spatial dimension as the input, but also to provide color information to each pixel in the
grayscale input image. We

start with a simple 256 x 256 pixel grayscale image as an input. We then use a neural network to output a
predicted colorized image. We have trained our model to output photos with realistic colors by training it on
realistic images. This does not mean that the output photo will match the ground truth every time. Instead,
the model should produce a colorized image so realistic that a viewer could not spot the fake when looking
at a true color image and an image produced by our model. Image colourization

is an image-to-image translation problem that maps a high dimensional input to a 7


high dimensional output.It can be seen as pixel-wise regression problem

and

The network not only needs to generate an output with the same spatial dimension 7
as the input, but also to provide color information to each pixel in the grayscale
input image

The approach being followed is to try an autoencoder based model for image colorization before trying out
GANs. The reason for this is the often cited difficulty in training a GAN model
_____________________________________________________________________________________
_____________________________________________________________________________________
along with the complexity in the model. Training a GAN is effectively iteratively training two fully fledged
neural networks, as opposed to one in an autoencoder’s case. The goal of this project is to do a comparative
study of Autoencoder based model and GAN model with auto encoder as the convolutional network for the
GAN discriminator and generator.So, our current approach of using an autoencoder model instead of GANs
has the advantage of being more time efficient. It gives us the chance to try out different models more
quickly and evaluate the obtained results. This is in contrast to GANs which have their own class of
difficulties owing to the need to train both the generator and discriminator to similar levels. 2. Problem
Definition The aim is to present models for image re-colorization and do comparative study about the
complexity of the models,training parameters and the output generated by these models after training

on cifar-10 dataset which is collection The CIFAR-10 dataset consists of 60000 15


32x32 colour images in 10 classes, with 6000 images per class. There are 50000
training images and 10000 test images. The

Stages of this project :


_____________________________________________________________________________________
_____________________________________________________________________________________
1. Pre-processing of cifar-10 dataset to gray scale and single channel 8 bit images. 2. Training an auto-
encoder network with cifar-10 dataset. 3. Training an GAN model with auto-encoder as the base-line
network. 4. Testing of the trained models. 3. Literature Survey 3.1 Exploring Convolutional Neural Networks
for Automatic Image Colorization 3.1.1 Introduction The approach consists of training a CNN with colored
images allowing it to learn color features without any human supervision. More specifically, they convert the
RGB image to the CIE Lab color space because the Lab color space provides a Lightness channel that they
can use as the grayscale input to our model. It is also known to provide a wider color gamut which should
enable more realistic image colorization. they pass in the L (lightness) channel as the input to our CNN
which then outputs the A and B channels (which represent colors for the grayscale image) and then
concatenate the input and output to generate the
_____________________________________________________________________________________
_____________________________________________________________________________________
full 3 channel image. They then convert this back into the RGB color space. The rest of the paper discusses
the methods, experiments and observations in more detail. 3.1.2 Approach Since this problem is an under
constrained multimodal problem with no right or wrong answer (i.e. an object can take multiple
colorizations), coming up with a loss function that enhances the visual appeal of the final images is an
interesting challenge. Thus experiment with different loss functions for our CNN model, including L2 and
smooth L1 (which use a regression based approach) and the cross-entropy loss (which uses a classification
based approach) to determine the best fit for the task. They also experiment with other CNN techniques
such as dropout, different activation functions and batch normalization to observe their effects. Given our
experience with loss functions and the inherent challenges with each of them, as well as CNN based
models’ tendency to produce sepia-tone colors for objects with ambiguous colors , they also explore other
approaches that could help solve this problem. This approach works similar to the CNN approach in that the
generative net takes in the L channel as the input, generating the color channels A and B. The discriminator,
on the other hand, takes in a generated colorization or a true colorization ("ground truth") and predicts if the
input was a true color image. Here, the competition between the discriminator and generator in maximizing
the accuracy of the predictions and minimizing the accuracy of the discriminator respectively results in
natural loss functions for backpropagation that do not rely on Euclidean distance measures. Our hope is that
this leads to brighter and more vibrant colors, since the focus is on generating more realistic colors than
those that are close to the training set. To evaluate our performance with these architectures and loss
functions, they use a combination of qualitative and quantitative metrics, including colorization Turing test
_____________________________________________________________________________________
_____________________________________________________________________________________
3.1.3 Method Baseline Model Baseline model involved training a simple CNN model (3 layers of CNN

followed by a Fully Connected layer) 28

on 2000 images from CIFAR-10. they used the L2 loss function as the objective,

ReLu as the activation function and batch normalization layers to 44

accelerate convergence and improve accuracy.

Figure 2 shows some sample outputs from 28

this model. As is evident from these images, the baseline model does not colorize the images well producing
very faint/dull colors and muted tonality. Objective Function One of the most important challenges in auto-
colorization is an objective function that accounts for the multimodal nature of the problem. To investigate
this further, experimented with several loss functions. Based on the results from the baseline model,
expecting the L2 loss to give under-saturated images with muted colors due to its averaging effect. This is
because the L2 loss, given various colorizations of an object that can take multiple colors (e.g. a car that can
be blue, green or red), will choose the mean colorization to reduce the model loss on the input grayscale
image. This predicted mean pixel value causes the output images to have muted colors and appear sepia
toned. Thus, while the L2 loss might seem well suited for this task, it does not work well for objects that
could take multiple colors

Cross-entropy loss is used for classification approach. This loss function 39

gives us a probability distribution for the class that each pixel can belong to, helping us select the best class
for each pixel. they get the most vivid, realistic and statured images using this method, since they are not
trying

to minimize the difference between the generated image and the ground truth as 30
in the

L2 loss, but are instead working with classes that offer the model
_____________________________________________________________________________________
_____________________________________________________________________________________
more flexibility. However, the number of classes is an important and sensitive hyper parameter here. If the
number of classes they generate with the bin() function is too large, then there is a high likelihood of the
model making an inaccurate prediction as it becomes tougher for the model to choose the correct class
amongst the increased class set.

Activation Function They use the rectified linear unit (ReLu) as the nonlinearity 10
that follows each of our convolutional layers.

Found that ReLu helped accelerate the training convergence. It is also extremely simple to compute this
function. One drawback for this function

is that the model parameters could be updated in such a way that the function's 10
active region may end up in the zero gradient

region which causes a gradient of 0 to backpropagate through the network, effectively ‘killing’ neurons and
preventing the network from training well. However, they did not run into this challenge in practice, and so
used ReLu as the activation function for our model. Dropout Dropout is an extremely effective regularization
technique introduced. While

training, dropout is implemented by only keeping a neuron active with some 11


probability or setting it to zero otherwise. During training, Dropout can be
interpreted as sampling a Neural Network within the full Neural Network, and only
updating the parameters of the sampled network based on the input data.

they introduced dropout right after the batchnorm layers in our network, but observed bad results. This is
because, given our problem and the dataset size, overfitting is not a challenge. In fact, they want the model
to learn as many diverse colorizations as possible. Dropout hinders this process by preventing the model
from learning ‘too much’. Thus, they left Dropout out of the final model.
_____________________________________________________________________________________
_____________________________________________________________________________________
Hyperparameter Tuning Used the Adam optimization method (default values

of β1=0.9 and β2=0.999) with a learning rate 27

of 1e-3 which was decayed to 1e-4 when the loss started to plateau (which usually happened around epoch
150). Trained for 200 epochs using batch sizes of 250 (varied slightly for different models) on a NVIDIA
Tesla K80 GPU. Another important hyperparameter was the number of bins in our classification model,
where they used 10 bins. All of these hyperparameter values were found after doing a random search over
the hyperparameter space. More specifically, they started off with a small training set of 100 images, and
used random search to narrow in for a range for these parameters. For instance, for the learning rate, they
ran multiple trials of training to see the learning rate that would yield the fastest convergence
over a fixed number of iterations. Within the set of learning rates sampled on a 10
logarithmic scale, they found that a learning rate of

1e-3

achieved one of the largest per-iteration decreases in training loss as well as the 10
lowest training loss of the learning rates sampled.

.2 Image-to-Image Translation with Conditional Adversarial Networks 33

3.2

.1 Introduction Many problems in image processing, computer graphics, and 3


computer vision can be posed as “translating” an input image into a corresponding
output image. Just as a concept may
_____________________________________________________________________________________
_____________________________________________________________________________________
be expressed in either English or French, a scene may be rendered as an RGB image, a
gradient field, an edge map, a semantic label map, etc. In analogy to automatic language
translation, they define automatic image-to-image translation as the task of translating one
possible representation of a scene into another, given sufficient training data.

Goal in this paper is to develop a common framework for all these problems. 3

3.2.2 Method

Generator with skips A defining feature of image-to-image translation problems is 3


that they map a high resolution input grid to a high resolution output grid. In
addition, for the problems they consider, the input and output differ in surface
appearance, but both are renderings of the same underlying structure. Therefore,
structure in the input is roughly aligned with structure in the output. they design the
generator architecture around these considerations.
In such a network, the input is passed through a series of layers that progressively 4
downsample, until a bottleneck layer, at which point the process is reversed. Such
a network requires that all information flow pass through all the layers, including the
bottleneck. For many image translation problems, there is a great deal of low-level
information shared between the input and output, and it would be desirable to shuttle this
information directly across the net. For example, in the case of image colorization, the
input and output share the location of prominent edges. To give the generator a means to
circumvent the bottleneck for information like this, they add skip connections, following the
general shape of a “U-Net”. Specifically, they add skip connections between each layer i
and layer n − i, where n is the total number of layers. Each skip connection simply
concatenates all channels at layer i with those at layer n − i. Markovian discriminator
(PatchGAN)
_____________________________________________________________________________________
_____________________________________________________________________________________
It is well known that the L2 loss – and L1, see Figure 4 – produces blurry results on image
generation problems. Although these losses fail to encourage high- frequency crispness,
in many cases they nonetheless accurately capture the low frequencies. For problems
where this is the case, they do not need an entirely new framework to enforce correctness
at the low frequencies. L1 will already do. This motivates restricting the GAN discriminator
to only model high-frequency structure, relying on an L1 term to force low-frequency
correctness. In order to model high-frequencies, it is sufficient to restrict our attention to
the structure in local image patches. Therefore, they design a discriminator architecture –
which they term a PatchGAN – that only penalizes structure at the scale of patches. This
discriminator tries to classify if each N ×N patch in an image is real or fake. they run this
discriminator convolutionally across the image, averaging all responses to provide the
ultimate output of D.

Optimization and inference To optimize our networks, they follow the standard 3
approach from: they alternate between one gradient descent step on D, then one
step on G. As suggested in the original GAN paper, rather than training G to minimize
log(1 − D(x, G(x, z)), they instead train to maximize log D(x, G(x, z)). In addition, they
divide the objective by 2 while optimizing D, which slows down the rate at which D learns
relative to G. they use mini batch SGD and apply the Adam solver, with a learning rate of
0.0002, and momentum parameters

β1 = 0.5, β2 =
0.999. At inference time, they run the generator net in exactly the same manner as 3
during the training phase. This differs from the usual protocol in that they apply
dropout at test time, and they apply batch normalization using the statistics of the test
batch, rather than aggregated statistics of the training batch. This approach to batch
normalization, when the batch size is set to 1, has been termed “instance normalization”
and has been
_____________________________________________________________________________________
_____________________________________________________________________________________
demonstrated to be effective at image generation tasks. In our experiments, they use
batch sizes between 1 and 10 depending on the experiment.

.3 U-Net: Convolutional Networks for Biomedical Image Segmentation 38

3.3.1 Introduction

Deep convolutional networks have outperformed the state of the art in many visual 1
recognition tasks. While convolutional networks have already existed for a long
time, their success was limited due to the size of the available training sets and the size of
the considered networks. The breakthrough by Krizhevsky was due to supervised training
of a large network with 8 layers and millions of parameters on the ImageNet dataset with
1 million training images. Since then, even larger and deeper networks have been trained
. The typical use of convolutional networks is on classification tasks, where the output to
an image is a single class label. However, in many visual tasks, especially in biomedical
image processing, the desired output should include localization, i.e., a class label is
supposed to be assigned to each pixel. Moreover, thousands of training images are
usually beyond reach in biomedical tasks. Hence, Ciresan et al. trained a network in a
sliding-window setup to predict the class label of each pixel by providing a local region
(patch) around that pixel as input. First, this network can localize. Secondly, the training
data in terms of patches is much larger than the number of training images.

_____________________________________________________________________________________
_____________________________________________________________________________________
Built

upon a more elegant architecture, the so-called “fully convolutional network” . they 8
modify and extend this architecture such that it works with very few training images
and yields more precise segmentations. The main idea is to supplement a usual
contracting network by successive layers, where pooling operators are replaced by
upsampling operators. Hence, these layers increase the resolution of the output. In order
to localize, high resolution features from the contracting path are combined with the
upsampled output. A successive convolution layer can then learn to assemble a more
precise output based on this information. 3.

3.2 Network Architecture

It consists of a contracting path and an expansive path. The contracting path 1


follows the typical architecture of a convolutional network. It consists of the
repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a
rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for
downsampling. At each downsampling step we double the number of feature channels.
Every step in the expansive path consists of an upsampling of the feature map followed
by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a
concatenation with the correspondingly cropped feature map from the contracting path,
and two 3x3 convolutions, each followed by a ReLU. The cropping is necessary due to the
loss of border pixels in every convolution. At the final layer a 1x1 convolution is used to
map each 64- component feature vector to the desired number of classes. In total the
network has 23 convolutional layers. To allow a seamless tiling of the output
segmentation map, it is important to select the input tile size such that all 2x2 max-pooling
operations are applied to a layer with an even x- and y-size.

_____________________________________________________________________________________
_____________________________________________________________________________________
4. Project Requirements Specification Our chosen topic revolves around generative models which are a
comparatively newer domain of machine learning as compared to classifiers. With most of our previous
experience in machine learning being with classifiers(both discriminative and generative one), a particular
requirement was to build some understanding of how generative models work, specifically with terms in this
space such as latent space, learning of a data distribution etc. With respect to the implementation, we have
an autoencoder model as well as a GAN model. 5. System Requirements Specification Functional
Requirements:
_____________________________________________________________________________________
_____________________________________________________________________________________
Perform image de-colorization: This is a necessary prerequisite step for our application. There are not many
readily available datasets consisting of black-and-white i.e grayscale images which we need to provide as
the input to our model. As a result, it is necessary to work with the present datasets which consist of color
images and process them to remove this color. The technical aspect to take care of in this regard
is the number of channels in the image. 11

The transformation takes the image from having three channels (RGB) to just one. As a result, the shape of
the numpy ndarray which holds the values representing pixel intensities across various channels has to be
changed accordingly. Train the model: The two different models both require their own specific architecture.
The encoder + decoder work together to first learn the abstract information from an image before performing
transpose convolution to add new information in order to come up with a new image. Save the
model(weights and biases learnt): The model’s weights need to be saved as the training process goes on.
This is useful as it helps in saving the state of progress of the model - in case the training of the model is
interrupted due to a break in connection, it can then be continued from the last saved state. Also, it helps
isolate the model from its training environment, as the

trained model can then be used to test new batches of images 23

without having to undergo the training process again. Test the model: The model needs to be tested with
some particular loss function so that the usefulness of the model can be quantified. Non-Functional
Requirements:
_____________________________________________________________________________________
_____________________________________________________________________________________
The model should not take too long to train: Considering the complexity involved with a lot of generative
models where some of the state of the art ones are trained on specialized hardware and still take many
hours (and sometimes days) to train, it is important to have a model which is not so intensive. In keeping
with our limited experience and the well documented issues which occur in training GAN models and their
variants, it becomes necessary for the model to train fast enough for us to analyze the progress made in
different metrics across epochs so that we can decide whether the training is progressing sufficiently well
enough to let it progress. The model should generalise well across images of different kinds: The success of
a deep learning model depends on how well it is able to generalise for different kinds of input and the same
holds true for our model. Though our

model is trained on a dataset wherein the images 23

are quite similar in terms of various photometric attributes: the gradient of the image and the general color
combinations, it should ideally be able to colorize images from different domains and provide realistic output
in different cases. Hardware Requirements: As with most deep learning models, the availability of a GPU
instance for training is highly beneficial in terms of saving time. We made use of the free GPU instance
available on Google Colab for training our models. The CPU and GPU specifications for the cloud instance
availed can be seen as:

GPU: 1xTesla K80, compute 3.7 having 2496 CUDA cores, 12GB GDDR5 VRAM 14
_____________________________________________________________________________________
_____________________________________________________________________________________
CPU: 1x single core hyper threaded Xeon Processors @2.3Ghz i.e. (1 core, 2 threads)
RAM: 12.6 GB Available Disk: 33 GB Available

Software Requirements OpenCV(for basic image processing tasks – such as conversion of colour images
into greyscale etc) The python package cv2 is the Python based solution for the OpenCV software (which is
natively written in C++) Pandas, numpy: These are the core packages related to loading of and wrangling
data. Numpy’s ndarray object is used extensively for handling data. It has an advantage over using native
Python data types as its implementation is in C, and thus provides faster data access as well as being more
memory efficient. Keras: It

is a high level API for building and training deep learning models. 25

Tensorflow: It is a lower level library for Machine Learning


_____________________________________________________________________________________
_____________________________________________________________________________________
6. System Design

An autoencoder is a type of artificial neural network used to learn efficient data 5


codings in an unsupervised manner. The aim of an autoencoder is to learn a
representation (encoding) for a set of data, typically for dimensionality reduction, by
training the network to ignore signal “noise”. Along with the reduction side, a
reconstructing side is learnt, where the autoencoder tries to generate from the reduced
encoding a representation as close as possible to its original input, hence its name.
Several variants exist to the basic model, with the aim of forcing the learned
representations of the input to assume useful properties. Examples are the regularized
autoencoders (Sparse, Denoising and Contractive autoencoders), proven effective in
learning representations for subsequent classification tasks, and Variational
autoencoders, with their recent applications as generative models. Autoencoders are
effectively used for solving many applied problems, from face recognition to acquiring the
semantic meaning of words.

The

simplest form of an autoencoder is a feedforward, non-recurrent neural network 5


similar to single layer perceptrons that participate in multilayer perceptrons (MLP) –
having an input layer, an output layer and one or more hidden layers connecting them –
where the output layer has the same number of nodes (neurons) as the input layer, and
with the purpose of reconstructing its inputs (minimizing the difference between the input
and the
_____________________________________________________________________________________
_____________________________________________________________________________________
output) instead of predicting the target value Y given inputs X. Therefore, autoencoders
are unsupervised learning models

_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________

A generative adversarial network (GAN) is a class of machine learning frameworks 2


invented by Ian Goodfellow and his colleagues in 2014 Two neural networks
contest with each other in a game (in the sense of game theory, often but not always in
the form of a zero-sum game). Given a training set, this technique learns to generate new
data with the same statistics as the training set. For example, a GAN trained on
photographs can generate new photographs that look at least superficially authentic to
human observers, having many realistic characteristics. Though originally proposed as a
form of generative model for unsupervised learning, GANs have also proven useful for
semi-supervised learning, fully supervised learning,and reinforcement learning.

The

generative network generates candidates while the discriminative network 2


evaluates them.The contest operates in terms of data distributions. Typically, the
generative network learns to map from a latent space to a data distribution of interest,
while the discriminative network distinguishes candidates produced by the generator from
the true data distribution. The generative network's training objective is to increase the
error rate of the discriminative network

A known dataset serves as the initial training data for the discriminator. Training it 2
involves presenting it with samples from the training dataset, until it achieves
acceptable accuracy. The generator trains based on whether it succeeds in fooling the
discriminator. Typically the generator is seeded with randomized input that is sampled
from a predefined latent space (e.g. a multivariate normal distribution). Thereafter,
candidates synthesized by the generator are evaluated by the discriminator.
Backpropagation is applied in both networks so that the generator produces better
images, while the discriminator becomes more skilled at flagging synthetic images. The
generator is typically a deconvolutional neural network, and the discriminator is a
convolutional neural network. GANs often suffer from a"mode collapse"where they fail to
generalize properly, missing entire modes from the input data. For example, a GAN
trained on the MNIST dataset
_____________________________________________________________________________________
_____________________________________________________________________________________
containing many samples of each digit, might nevertheless timidly omit a subset of the
digits from its output. Some researchers perceive the root problem to be a weak
discriminative network that fails to notice the pattern of omission, while others assign
blame to a bad choice of objective function.

The general GAN structure: Our GAN model uses an autoencoder for the generator model
_____________________________________________________________________________________
_____________________________________________________________________________________
7. Detailed Design For the autoencoder model: As our implementation consists primarily of a deep learning
model, training it, validating it, and finally testing it on various mutually exclusive subsets of our image
dataset, there are no separate modules in which we have divided our work. Instead, the implementation can
be viewed as a collection of multiple Jupyter Notebook cells, each of which introduces or progresses the
work of building a functioning model. In terms of functionality, the autoencoder based model can be
considered to be made up of the following parts:
_____________________________________________________________________________________
_____________________________________________________________________________________
1. where all the necessary packages,sub-packages and methods are loaded into the environment
namespace 2. the function to de-colorize an RGB image - this is later applied to both the training and testing
sets before passing them through the model 3. the feature to stack up a variable number of images from the
dataset and view them side by side to get an idea of the variance in the dataset 4. Defining the network
parameters - this includes defining the input layer which should be in accordance with the shape and

size of the input images, the batch size which is 43

a training-specific parameter and other parameters relevant to the CNN based structure of the encoder
model. This includes the kernel

size and the stride. Another important parameter is the size of the 36

latent vector. It is this value that specifies the amount of detail in what our model has learnt. 5. This is
followed by generating the structure and summary of both the encoder and decoder models. 6. More training
phase specific parameters: We make use of ReduceLRonPlateau method which dynamically alters the
learning rate if there isn't any sufficient improvement after a set number of epochs. Checkpoints are also
defined for saving the model periodically as well as the end of training.
_____________________________________________________________________________________
_____________________________________________________________________________________
7. This is followed by instantiating the actual training process, followed by ‘fitting’ the trained model on the
testing set of our data. The resulting output is then saved locally or can be compared to the original input
which yielded it. For the GAN model, the design is more detailed as it involves the simultaneous training of
two full-fledged deep learning models. As a result, the implementation is more modularized. For this
codebase, we have the following modules: ● dataset.py ● main.py ● models.py ● networks.py ● ops.py ●
options.py ● utils.py The datasets.py consists of a class based wrapper over the core dataset(CIFAR 10)
that we have used to train and test our core model. Instead of a simple import and splitting of the dataset,
we have defined custom TestDataset, BaseDataset and Cifar10Dataset classes with relevant helper
methods which serve to help provide useful utility wrappers over the datasets such as simple and
convenient unpickling, shuffling and stacking of images. The main.py module is the introduction to the
workflow of the project as this is where the computational graph which powers a defined Tensorflow model
is defined. This is where the tensorflow session is initialized as well as defining the conditions for the training
and building of the model as well as loading of the defined model afterwards for future usage. The
models.py is again an object-oriented definition of the GAN models which are trained and used. It consists of
a BaseModel class as well as a child class which inherits the basic
_____________________________________________________________________________________
_____________________________________________________________________________________
attributes from this model. This setup makes the code extensible by allowing us to incorporate other different
GAN models in the same project by defining it here, making it inherit from the BaseModel parent class. The
networks.py module defines the individual neural networks which make up our GAN model i.e. the
discriminator and the generator. The reasoning behind organizing it this way is that it allows us to combine
differently defined discriminators and generators as desired in the future. This is reflected in the BaseModel
class which takes class instances for both discriminator and generator neural networks as arguments. The
ops.py module defines both pre-processing as well as post-processing steps such as decolorizing input to
feed into the network, as well as storing output images. In the options.py module, we make use of the
standard library package argparse for defining command line arguments which can be passed when running
our model with customized input for different behaviours. Some of these are changing the number of epochs
to be used for training, the batch size for the stochastic gradient descent process, changing the learning rate
decay feature(something we made use of in our autoencoder model as well) as well as the training status
logging frequency, among other such options. In the utils.py, we have simple helper methods which provide
features such as saving the pickled model, showing a progress bar, plotting graphs provided an array input
etc. 8. Implementation and Pseudocode Implementation for the autoencoder:
_____________________________________________________________________________________
_____________________________________________________________________________________
The autoencoder model exists as a single Jupyter Notebook which is modularized in different functions and
cells(each Jupyter Notebook cell has independent execution ability while sharing namespace with other cells
which have been defined above) each with specific utility. This involves loading of the required modules from
different packages as well as the dataset, defining pre-processing functions for converting the default three-
channel RGB images into a single channel grayscale one, viewing the original(ground truth), decolorized as
well as the colorized output images as a stack of multiple images(whose dimensions can be configured) 9.
Testing The testing for the autoencoder model effectively looks at how well the model is able to construct an
output image by placing it in direct contrast with the input image. The model uses the mean square error
loss function which in this case effectively tells us the average pixel-wise

difference between the input and output images. The dataset is


46

broken up

into a training and testing set in the ratio 34

of 5:1 and training is done in batches of size 32, with a new batch of images being trained on every epoch.
_____________________________________________________________________________________
_____________________________________________________________________________________
10. Results and Discussion Results for the Autoencoder Model:
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
As can be seen, the autoencoder model outputs realistic colors for the different varieties of images, though
they do not match up exactly with the corresponding ground truth image. The autoencoder fails to capture
the subtlety in the original (ground truth) images. This is also seen in other images in the dataset - the
coloring obtained from the autoencoder is consistent but doesn't always match the original although it does
capture a realistic color, which is predictably due to the fact that there are multiple images from the same
‘class’. Results for the GAN model
_____________________________________________________________________________________
_____________________________________________________________________________________
Original colored image(Ground truth) above
_____________________________________________________________________________________
_____________________________________________________________________________________
Colored GAN image(Predicted output) The output provided by the GAN model, in contrast to the
autoencoder model, does not provide consistent coloring as the images in the output look to be colored in
patches - especially with out of place tinges in multiple places. This phenomenon is noticed in other images
too where certain parts of the image look to be thoroughly colored but some other parts look discolored but
with patches of some color present. In contrast to the autoencoder model, the output images of the GAN
model are much more patchier - the images seem to look like black-and-white images with patches of colors
filled in instead of having a well rounded colored image.
_____________________________________________________________________________________
_____________________________________________________________________________________
11. Snapshot Figure-1
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
_____________________________________________________________________________________
12. Conclusions After working on implementing the two kinds of generative models, we got a firsthand idea
and experience of the difficulties in training a GAN model (where the efficacy depends on how well two
neural networks train concurrently). As a result, we observed that we obtained better results with the
autoencoder model than we did with the GAN model. Despite the fact that GAN models are generally used
to and are able to achieve more complex tasks than autoencoders, the difficulties with training them in a
stable manner remain, as well as the need for extensive computational resources. Thus, for our particular
use case, an autoencoder based model worked out well. The autoencoder model trained for 30 epochs
resulting in a loss of 0.1196. This reflected in the generated images which were generally accurate and
continuous in nature. 13. Further enhancements Our work revolved around exploring generative models, in
contrast to the established CNN based solutions in the problem domain. The field of GANs however is
rapidly advancing with new ideas and architectures being proposed all the time. An example of this is the
fact that GANs are now applicable to text based generation problems, something which the creator of vanilla
GANs, Ian Goodfellow himself didn't envision initially. Similarly,

in the domain of image to image translation 45

tasks such as this one, some novel GAN based approaches are being proposed. One of these is called the
Pix2Pix GAN model. This has
_____________________________________________________________________________________
_____________________________________________________________________________________
similarities to the GAN model we used for our project as it constitutes an autoencoder for the generator of
the GAN. However, its definition of loss is different to our discriminator’s - whereas our discriminator
classifies an image as real or fake, the Pix2Pix model’s discriminator takes a novel approach in that it
classifies sections/chunks of the generator’s output as real/fake. Another aspect of improvement in our work
is in the evaluation metrics for the GAN model.

Unlike other deep learning neural network models that are trained with a loss 9
function until convergence, a GAN generator model is trained using a second
model called a discriminator that learns to classify images as real or generated. Both the
generator and discriminator model are trained together to maintain an equilibrium. As
such, there is no objective loss function used to train the GAN generator models and no
way to objectively assess the progress of the training and the relative or absolute quality
of the model from loss alone.

There is some scope for enhancement here however: quantitative approaches for evaluation involve making
use of low level image statistics as well as other quantitative approaches like Inception score, Wasserstein
critic etc. References/Bibliography Source for Google Colab instance specs:

https://colab.research.google.com/drive/151805XTDg--dgHb3- 32
AXJCpnWaqRhop_2

_____________________________________________________________________________________
_____________________________________________________________________________________
Google Developer’s introductory post on GAN models: https://developers.google.com/machine-learning/gan/
Tips for training stable GAN models: https://machinelearningmastery.com/how-to-train-stable-generative-
adversarial-networks/ Beginner’s guide to GANs: https://pathmind.com/wiki/generative-adversarial-network-
gan Visual progression of training of a GAN model from a randomly picked data distribution(helped in
developing an intuition for the training process): https://poloclub.github.io/ganlab/ Proposed alternative to
vanilla GANs (also discusses shortcomings of vanilla GAN models): https://towardsdatascience.com/pix2pix-
869c17900998 Stanford CS231 course lecture on generative models covering everything from probability
distributions to autoencoders and its variants to GAN models: https

://www.youtube.com/watch?v= 5WoItGTWV54 &feature=youtu.be 42

Detailed explanation of the backpropagation process in training a GAN:


_____________________________________________________________________________________
_____________________________________________________________________________________
https://www.youtube.com/watch?v=RRTuumxm3CE Comprehensive introduction to autoencoders:
https://towardsdatascience.com/generating-images-with-autoencoders-77fd3a8dd368 A well curated kaggle
description of

autoencoders: https://www.kaggle.com/shivamb/how-autoencoders-work-intro- 26
and-usecases

Identifying and diagnosing GAN failure modes: https://machinelearningmastery.com/practical-guide-to-gan-


failure-modes/ Improving

GAN performance: https://towardsdatascience.com/gan-ways-to-improve-gan- 29


performance-acf37f9f59b

Mode Collapse in GAN: what it is and implications: https://cedar.buffalo.edu/~srihari/CSE676/22.3-GAN


Mode Collapse.pdf
_____________________________________________________________________________________
_____________________________________________________________________________________
Papers referred to: 1.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image 7
translation with conditional adversarial networks. 2016

2.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional 16


networks for biomedical image segmentation. In International Conference on
Medical Image Computing and Computer-Assisted Intervention. Springer, 2015.

3. http://irvlab.cs.umn.edu/projects/adversarial-image-colorization Given resource provides an overview of


some different GAN architectures which have been or can be applied to this task.

4. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep 17


network training by reducing internal covariate shift. In International Conference
on Machine Learning, 2015. 5.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, 19
and Xi Chen. Improved techniques for training gans..

6. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional 22


networks for semantic segmentation.

_____________________________________________________________________________________
Automatic Image Colorization using Generative Model Automatic Image Colorization using Generative
Model Automatic Image Colorization using Generative Model Automatic Image Colorization using
Generative Model Automatic Image Colorization using Generative Model Automatic Image Colorization
using Generative Model Automatic Image Colorization using Generative Model Automatic Image
Colorization using Generative Model Automatic Image Colorization using Generative Model Automatic
Image Colorization using Generative Model Automatic Image Colorization using Generative Model
Automatic Image Colorization using Generative Model Automatic Image Colorization using Generative
Model Automatic Image Colorization using Generative Model Automatic Image Colorization using
Generative Model Automatic Image Colorization using Generative Model Automatic Image Colorization
using Generative Model Automatic Image Colorization using Generative Model Automatic Image
Colorization using Generative Model Automatic Image Colorization using Generative Model Automatic
Image Colorization using Generative Model Automatic Image Colorization using Generative Model
Automatic Image Colorization using Generative Model Automatic Image Colorization using Generative
Model Automatic Image Colorization using Generative Model Automatic Image Colorization using
Generative Model Automatic Image Colorization using Generative Model Automatic Image Colorization
using Generative Model Automatic Image Colorization using Generative Model Automatic Image
Colorization using Generative Model Automatic Image Colorization using Generative Model Automatic
Image Colorization using Generative Model Automatic Image Colorization using Generative Model
Automatic Image Colorization using Generative Model Automatic Image Colorization using Generative
Model Automatic Image Colorization using Generative Model Automatic Image Colorization using
Generative Model Automatic Image Colorization using Generative Model Automatic Image Colorization
using Generative Model Automatic Image Colorization using Generative Model Automatic Image
Colorization using Generative Model Automatic Image Colorization using Generative Model Automatic
Image Colorization using Generative Model Automatic Image Colorization using Generative Model Dept. Of
CSE Jan-May, 2020 Page 1 Dept. Of CSE Jan-May, 2020 Page 2 Dept. Of CSE Jan-May, 2020 Page 3
Dept. Of CSE Jan-May, 2020 Page 4 Dept. Of CSE Jan-May, 2020 Page 5 Dept. Of CSE Jan-May, 2020
Page 6 Dept. Of CSE Jan-May, 2020 Page 7 Dept. Of CSE Jan-May, 2020 Page 8 Dept. Of CSE Jan-May,
2020 Page 9 Dept. Of CSE Jan-May, 2020 Page 10 Dept. Of CSE Jan-May, 2020 Page 11 Dept. Of CSE
Jan-May, 2020 Page 12 Dept. Of CSE Jan-May, 2020 Page 13 Dept. Of CSE Jan-May, 2020 Page 14 Dept.
Of CSE Jan-May, 2020 Page 15 Dept. Of CSE Jan-May, 2020 Page 16 Dept. Of CSE Jan-May, 2020 Page
17 Dept. Of CSE Jan-May, 2020 Page 18 Dept. Of CSE Jan-May, 2020 Page 19 Dept. Of CSE Jan-May,
2020 Page 20 Dept. Of CSE Jan-May, 2020 Page 21 Dept. Of CSE Jan-May, 2020 Page 22 Dept. Of CSE
Jan-May, 2020 Page 23 Dept. Of CSE Jan-May, 2020 Page 24 Dept. Of CSE Jan-May, 2020 Page 25 Dept.
Of CSE Jan-May, 2020 Page 26 Dept. Of CSE Jan-May, 2020 Page 27 Dept. Of CSE Jan-May, 2020 Page
28 Dept. Of CSE Jan-May, 2020 Page 29 Dept. Of CSE Jan-May, 2020 Page 30 Dept. Of CSE Jan-May,
2020 Page 31 Dept. Of CSE Jan-May, 2020 Page 32 Dept. Of CSE Jan-May, 2020 Page 33 Dept. Of CSE
Jan-May, 2020 Page 34 Dept. Of CSE Jan-May, 2020 Page 35 Dept. Of CSE Jan-May, 2020 Page 36 Dept.
Of CSE Jan-May, 2020 Page 37 Dept. Of CSE Jan-May, 2020 Page 38 Dept. Of CSE Jan-May, 2020 Page
39 Dept. Of CSE Jan-May, 2020 Page 40 Dept. Of CSE Jan-May, 2020 Page 41 Dept. Of CSE Jan-May,
2020 Page 42 Dept. Of CSE Jan-May, 2020 Page 43 Dept. Of CSE Jan-May, 2020 Page 44

You might also like