Professional Documents
Culture Documents
A PROJECT REPORT
Submitted in partial fulfilment of the requirements for the award of the
degree
of
BACHELOR OF TECHNOLOGY (HONS.)
In
(Department of Electronics & Communication Engineering)
Submitted by
1
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, RANCHI –
834010 (JHARKHAND), INDIA
IIIT RANCHI
Date: 10/05/2021
CERTIFICATE
This is to certify that the project titled “ Image Caption Generator” is a record of
the bonafide work done by “Shubham Chandra Poddar”(2017UGEC048R) &
“Sarthak Srivastava”(2017UGEC031R) submitted in partial fulfilment of the
requirements for the award of the Degree of Bachelor of Technology (Hons.) in
Department of Electronics and Communication Engineering of Indian Institute
of Information Technology Ranchi, during the academic year 2020-21.
2
ACKNOWLEDGMENT
Regards,
Shubham Chandra Poddar (2017UGEC048R)
Sarthak Srivasatava(2017UGEC031R)
3
ABSTRACT
4
LIST OF FIGURES
Figure No. Figure Title Page No.
1.1 Demo Image 7
2.1 Generating Sentence for Image 9
3.1 Describing Image 10
3.2 Automatic Image Captioning using Recurrent Neural Network 11
4.1 Architecture of VGG16 model 12
4.2 LSTM Architecture 13
4.3 Schematic of Merge Model For Image Captioning 15
4.4 Plot of the Caption Generation Deep Learning Model 16
4.5 UI/UX of our app 17
4.6 Code Structure of our app 17
4.7 Uploaded Image Prediction 17
4.8 Output of our model through API 18
5.1 Input Image 19
5.2 Output of the above Input Image 19
LIST OF TABLES
5
Contents
Page No
Acknowledgement 3
Abstract 4
List of Figures 5
List of Tables 5
Chapter 1 INTRODUCTION
1.1 Motivation 7
1.2 Objectives of the project 7
Chapter 3 METHODOLOGY
3.1 Detailed Methodology 10
Chapter 4 IMPLEMENTATION
4.1 Prepare Photo Data 12
4.2 Prepare Text Data 13
4.3 Developing Deep Learning Model 13
4.4 Deployment 17
6
CHAPTER 1
INTRODUCTION
1.1. Motivation
Aid to the blind - We can create a product for the blind which will guide them travelling
on the roads without the support of anyone else.We can do this by first converting the scene
into text and then the text to voice. Both are now famous applications of Deep Learning.
Self driving cars - Automatic driving is one of the biggest challenges and if we can
properly caption the scene around the car, it can give a boost to the self driving system.
● How to prepare photo and text data for training a deep learning model.
● How to design and train a deep learning caption generation model.
● How to evaluate a train caption generation model and use it to caption entirely new
photographs.
7
CHAPTER 2
LITERATURE SURVEY
1. State-of-the-art CNN models for image classification are judged by their performance on
the ImageNet challenge. They have proceeded incrementally since Krizhevsky et al., 2012
introduced AlexNet in 2012, with modifications to the architecture that have improved
performance. In 2014, Szegedy et al., 2015 introduced GoogLeNet, which was an
improvement on AlexNet, mainly through greatly reducing the number of parameters
involved. Also, in 2014, Simonyan & Zisserman, 2014 introduced the VGGNet, which
achieved good performance because of the depth of the network. Most recently, in 2015,
He et al., 2015 introduced the ResNet, which utilizes “skip connections” and batch
normalization. The performance of these models on ImageNet is shown below
2. The image captioning problem has also seen a lot of work in recent years. In the last year,
Karpathy & Fei-Fei, 2015 introduced a method that combines a pre-trained CNN
(VGGNet) as a feature extractor, a Markov Random Field (MRF) model for alignment,
and an RNN for generating text descriptions. The approach of Donahue et al., 2015,
similarly, uses a pre-trained VGGNet as a feature extractor, and directly inputs these
feature vectors and the word embedding vectors for sentences at each time step of a Long
Short-Term Memory (LSTM) model, which was introduced by Hochreiter &
Schmidhuber, 1997. In Vinyals et al., 2015, Vinyals et al. improved on this approach by
only inputting the image vector at the first time step of the LSTM, which they found to
improve the results.
4. A large amount of work has been done on image caption generation tasks. The first
significant work in solving image captioning tasks was done by Ali Farhadi where three
spaces are defined namely the image space, meaning space and the sentence space where
mapping is done from the respective image and sentence space to the meaning space.
9
CHAPTER 3
METHODOLOGY
The task of image captioning can be divided into two modules logically – A CNN and an RNN.
The captioning is all about merging the two to get their most powerful attributes.
CNN(Convolutional Neural Network) preserves spatial information and recognizes objects in the
image. RNN(Recurrent Neural Network) works well with any kind of sequential data, such as
generating a sequence of words. So by merging the two, we can get a model that can find a pattern
and features in the images and then use the information to generate a description of those images.
There are many open source datasets available for this problem, like Flickr 8k(containing 8k
images), Flickr30k (containing 30k images), MSCOCO (containing 180k images) etc. We have
used Flicker 8K dataset in this project. The dataset consists of two categories: Flicker8K_Dataset
(Contains 8092 photograph in jpeg format), Flicker8K_text (Contains a no. of files containing
different sources of descriptions for the photographs). The dataset has a pre-defined training
set(6000 images), validation set(1000 images), test set(1000 images).
The image below summarizes the approach given above:
Usually, a pretrained CNN extracts the features from our input image. The feature vector is
linearly transformed to have the same dimension as the input dimension of the RNN/LSTM
network. This network is trained as a language model on our feature vector.
For training our LSTM model, we predefine our label and target text. For example, if the caption
is :
“A man and a girl sit on the ground and eat.”, our label and target would be as follows –
10
Label – [ <start>, A, man, and, a, girl, sit, on, the, ground, and, eat, . ]
Target – [ A, man, and, a, girl, sit, on, the, ground, and, eat, ., <end> ]
This is done so that our model understands the start and end of our labelled sequence.
KeyWords:
Embedding Layer: Embedding is used as the input layer of the machine learning model. As a
machine learning model only works with numbers,so first different words are converted to
numbers and then it is feed to the model.So basically it’s a dense vector of fixed size.
Long Short Term Memory(LSTM) : LSTM is a kind of recurrent neural network. In RNN
output from the last step is fed as input in the current step. LSTM tackled the problem of
long-term dependencies of RNN where RNN cannot predict the word stored in the long term
memory but can give more accurate predictions from the recent information. LSTM can hold
information for a long period of time. It is used for processing, predicting and classifying
sequence data.
11
CHAPTER 4
IMPLEMENTATION
We can load the VGG model in Keras using the VGG class. We will remove the last layer
from the loaded model, as this is the model used to predict a classification for a photo. We
are not interested in classifying images, but we are interested in the internal representation
of the photo right before a classification is made. These are the “features” that the model
has extracted from the photo. Keras also provides tools for reshaping the loaded photo into
the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).
12
4.2 Prepare Text Data
The dataset contains multiple descriptions for each photograph and the text of the
descriptions requires some minimal cleaning. Each photo has a unique identifier. This
identifier is used on the photo filename and in the text file of descriptions. We need to map
each photo identifier to a list of one or more textual descriptions.
We will clean the text in the following ways in order to reduce the size of the vocabulary
of words we will need to work with:
Once cleaned, we can summarize the size of the vocabulary. Ideally, we want a vocabulary
that is both expressive and as small as possible. A smaller vocabulary will result in a
smaller model that will train faster.
1. Loading Data.
2. Defining the Model.
3. Fitting the Model.
First, we must load the prepared photo and text data so that we can use it to fit the model.
We are going to train the data on all of the photos and captions in the training dataset.
While training, we are going to monitor the performance of the model on the
development/validation set. The train and development dataset have been predefined in the
Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt files respectively, that both contain
13
lists of photo file names. From these file names, we can extract the photo identifiers and
use these identifiers to filter photos and descriptions for each set.
The model we will develop will generate a caption given a photo, and the caption will be
generated one word at a time. The sequence of previously generated words will be
provided as input. Therefore, we will use ‘startseq’ to kick-off the generation process and
‘endseq ‘ to signal the end of the caption. It is important to do this now before we encode
the text so that the tokens are also encoded correctly. Next, we can load the photo features
for a given dataset and after that we will encode the text.
Each description will be split into words. The model will be provided one word and the
photo and generate the next word. Then the first two words of the description will be
provided to the model as input with the image to generate the next word. This is how the
model will be trained.
For example, the input sequence “little girl running in field” would be split into 6
input-output pairs to train the model:
Later, when the model is used to generate descriptions, the generated words will be
concatenated and recursively provided as input to generate a caption for an image.
The function below named create_sequences(), given the tokenizer, a maximum sequence
length, and the dictionary of all descriptions and photos, will transform the data into
input-output pairs of data for training the model. There are two input arrays to the model:
one for photo features and one for the encoded text. There is one output for the model
which is the encoded next word in the text sequence.
We now have enough to load the data for the training and development datasets and
transform the loaded data into input-output pairs for fitting a deep learning model.
14
4.3.2 Defining the model
● Photo Feature Extractor. This is a 16-layer VGG model pre-trained on the ImageNet
dataset. We have pre-processed the photos with the VGG model (without the output
layer) and will use the extracted features predicted by this model as input. The Photo
Feature Extractor model expects input photo features to be a vector of 4,096 elements.
These are processed by a Dense layer to produce a 256 element representation of the
photo.
● Sequence Processor. This is a word embedding layer for handling the text input,
followed by a Long Short-Term Memory (LSTM) recurrent neural network layer. The
Sequence Processor model expects input sequences with a predefined length (34
words) which are fed into an Embedding layer that uses a mask to ignore padded
values. This is followed by an LSTM layer with 256 memory units.
● Decoder. Both the feature extractor and sequence processor output a fixed-length
vector. These are merged together and processed by a Dense layer to make a final
prediction. The Decoder model merges the vectors from both input models using an
addition operation. This is then fed to a Dense 256 neuron layer and then to a final
output Dense layer that makes a softmax prediction over the entire output vocabulary
for the next word in the sequence.
Both the input models (Photo Feature Extractor and Sequence processor) produce a 256
element vector. Further, both input models use regularization in the form of 50% dropout. This
is to reduce overfitting the training dataset, as this model configuration learns very fast.
15
4.3.3 Fitting the model
Now that we know how to define the model, we can fit it on the training dataset.
The model learns fast and quickly overfits the training dataset. For this reason, we will
monitor the skill of the trained model on the holdout development dataset. When the skill
of the model on the development dataset improves at the end of an epoch, we will save the
whole model to file. At the end of the run, we can then use the saved model with the best
skill on the training dataset as our final model.
16
4.4 Deployment
4.4.1. Integration of our model with web-app using flutter framework :
17
4.4.2 Exposing our model to Containerized world through OpenShift
18
CHAPTER 5
RESULT AND ANALYSIS
Our model correctly outputs the captions in the right sequence with accuracy of 71% for unbiased
model and 97% for 20 epochs of biased model.
We have achieved a BLEU score 0.683 for our model. BLEU for short, is a metric for evaluating a
generated sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a
perfect mismatch results in a score of 0.0
19
We tried to propose different models ( CNN+RNN , CNN+LSTM , CNN+RNN+LSTM )
in which CNN+LSTM model gave the best accuracy.
20
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
6.1.Conclusion of work
We have implemented a deep learning approach for the captioning of images. We have used
the Tensorflow framework to implement our deep learning model to achieve an effective
BLEU score 0.683 for our model. BLEU for short, is a metric for evaluating a generated
sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a perfect
mismatch results in a score of 0.0. We have used PreTrained Photo Models to improve the
feature extraction of the model. We have done lots of word processing and converted it into
word vectors using the Word Embedding layer before input to short-term memory (LSTM)
recurrent neural network layer. The configuration of the model was tuned, but other alternate
configurations can be trained to see for improvement in the performance of the image
captioning model.
21
REFERENCES
[2] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and Tell: Lessons Learned from the
2015 MSCOCO Image Captioning Challenge," in IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 4, pp. 652-663, 1 April 2017.described by Marc Tanti, et al. in
their 2017 papers:
[4]What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?, 2017
[5] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing systems, pp.
1097–1105, 2012.
[6] LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson, Donnie, Howard, Richard E,
Hubbard, Wayne, and Jacke l, Lawrence D. Backpropagation applied to handwritten zip code
recognition. Neural computation, 1(4):541–551, 1989.
[7] LeCun, Yann, Bottou, L´eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[8] Steinkrau, Dave, Simard, Patrice Y, and Buck, Ian. Using gpus for machine learning
algorithms. In null, pp. 1115–1119. IEEE, 2005.
[9] Karpathy, Andrej and Fei-Fei, Li. Deep visual semantic alignments for generating image
descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3128–3137, 2015.
[10]Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556, 2014. URL http: //arxiv.org/abs/1409.1556
22
ANNEXURES
Product Data sheets : There are many dataset available for image captioning .The most
common used datasets are :
1. COCO - The dataset contains over 330k images, each of which has at least 5 different
caption annotations with training data is itself of 13 GB
2. The VizWiz-Captions dataset includes: 23,431 training images; 117,155 training captions;
7,750 validation images; 38,750 validation captions; 8,000 test images.
2. METEOR: METEOR stands for metric for evaluation and translation with explicit
ordering. While BLEU takes account of entire text generated overshadowing the score of
each and individual sentence generated the METEOR takes care of that. For doing so
METEOR enhances the precision and recall functions. Instead of precision and recall the
meteor utilizes weighted F-score for mapping unigram and for incorrect word order it uses
penalty function.
3. ROUGE-L: ROUGE stands for recall oriented understudy for gisting evaluation. As clear
from its name ROUGE is only based on recall but ROUGE-L is based on its F score which
is harmonic mean of its precision and recall values.
Codes(Algorithms)
Code snippet to predict description of image :
generate_desc: function to
generate caption of image
24
Photo Feature Extraction Code :
25
Github Link : https://github.com/sarthak-sriw/image_captiongenerator
Colab link :
https://github.com/sarthak-sriw/image_captiongenerator/blob/main/Image_caption_
generator.ipynb
26
PROJECT DETAILS
Student Details
Student Name Shubham Chandra Poddar
Register Number 2017UGEC048R Section / Roll No 048
Email Address subh.btech.ec17@iiitranchi Phone No (M) 8877042933
.ac.in
Project Details
Project Title Image Caption Generator
Project Duration 5 months Date of reporting 10/05/2021
Internal Guide Details
Faculty Name Dr. Dhananjoy Bhakta
Full contact address IIIT Ranchi ,Namkum,Ranchi,834010
with pin code
Email address bhaktadhananjoy@iii Phone No (M) 6290262605
tranchi.ac.in
Student Details
Student Name Sarthak Srivastava
Register Number 2017UGEC031R Section / Roll No 031
Email Address sarthak.btech.ec17@iiitran Phone No (M) 7985260261
chi.ac.in
Project Details
Project Title Image Caption Generator
Project Duration 5 months Date of reporting 10/05/2021
Internal Guide Details
Faculty Name Dr. Dhananjoy Bhakta
Full contact address IIIT Ranchi ,Namkum,Ranchi,834010
with pin code
Email address bhaktadhananjoy@iii Phone No (M) 6290262605
tranchi.ac.in
27