Image Caption Generator Report

भारतीय सच
ू ना प्रौद्योगिकी संस्थान राँची

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, RANCHI
(An Institution of National importance under act of Parliament)
(Ranchi - 834010), Jharkhand
Image Caption Generator
A PROJECT REPORT
Submitted in partial fulfilment of the requirements for the award of the
degree
of
BACHELOR OF TECHNOLOGY (HONS.)
In
(Department of Electronics & Communication Engineering)
Submitted by
(SHUBHAM CHANDRA PODDAR)

(2017UGEC048R)
(SARTHAK SRIVASTAVA)
(2017UGEC031R)
Under the Supervision of
Dr. Dhananjoy Bhakta

(Department of Computer Science & Engineering)
1
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY, RANCHI –
834010 (JHARKHAND), INDIA
IIIT RANCHI
Date: 10/05/2021
CERTIFICATE
This is to certify that the project titled “ Image Caption Generator” is a record of
the bonafide work done by “Shubham Chandra Poddar”(2017UGEC048R) &
“Sarthak Srivastava”(2017UGEC031R) submitted in partial fulfilment of the
requirements for the award of the Degree of Bachelor of Technology (Hons.) in
Department of Electronics and Communication Engineering of Indian Institute
of Information Technology Ranchi, during the academic year 2020-21.
Dr. Dhananjoy Bhakta

Project Guide, Dept. of CSE
Indian Institute of Information Technology Ranchi
Dr. Shashi Kant Sharma

Faculty In-Charge: Academics
Indian Institute of Information Technology Ranchi
2
ACKNOWLEDGMENT
Firstly, we would like to thank IIIT Ranchi for giving us an opportunity

to perform this project. We surely have learnt a lot from here. We hope
for everything that we have learnt here will be helpful in future.
With great pleasure we would like to express our deep sense of gratitude
to Dr. Dhananjoy Bhakta(Dept. Of Computer Science and Engineering)
for guiding us throughout the journey of our project. We would like to
thank them for having immense patience while teaching us about the
project and sparing his previous time for us.
We would also like to thank our Faculty In-Charge(Academics) Dr. Sashi
Kant Sharma for providing facilities and infrastructure in the academics
and would also like to thank other faculties for providing us a great
opportunity to work on such an important project.
Regards,
Shubham Chandra Poddar (2017UGEC048R)
Sarthak Srivasatava(2017UGEC031R)
3
ABSTRACT
We have Developed a Image Caption Generator which is a challenging

artificial intelligence problem where a text description must be generated
for a given photograph.
It requires both method from Computer Vision(CNN) to understand the

content of the image and a language model from the field of Natural
Language Processing to turn the understanding of image into words in
right order. Dataset that is being used for this project is the Flickr 8K
dataset.
The dataset consists of two categories: Flicker8K_Dataset (Contains

8092 photograph in jpeg format), Flicker8K_text (Contains a no. of files
containing different sources of descriptions for the photographs). The
dataset has a pre-defined training set(6000 images), validation set(1000
images), test set(1000 images).
4
LIST OF FIGURES
Figure No. Figure Title Page No.
1.1 Demo Image 7
2.1 Generating Sentence for Image 9
3.1 Describing Image 10
3.2 Automatic Image Captioning using Recurrent Neural Network 11
4.1 Architecture of VGG16 model 12
4.2 LSTM Architecture 13
4.3 Schematic of Merge Model For Image Captioning 15
4.4 Plot of the Caption Generation Deep Learning Model 16
4.5 UI/UX of our app 17
4.6 Code Structure of our app 17
4.7 Uploaded Image Prediction 17
4.8 Output of our model through API 18
5.1 Input Image 19
5.2 Output of the above Input Image 19
LIST OF TABLES
Table No. Table Title Page No.

2.1 Error rates on ImageNet challenge 8
4.1 Input Output Sequence 14
5.1 Accuracy of different model 20
5.2 Accuracy of Different Evaluation Matrices 20
5
Contents
Page No
Acknowledgement 3
Abstract 4
List of Figures 5
List of Tables 5
Chapter 1 INTRODUCTION
1.1 Motivation 7
1.2 Objectives of the project 7
Chapter 2 LITERATURE SURVEY

2.1 Previous Work 8
Chapter 3 METHODOLOGY
3.1 Detailed Methodology 10
Chapter 4 IMPLEMENTATION
4.1 Prepare Photo Data 12
4.2 Prepare Text Data 13
4.3 Developing Deep Learning Model 13
4.4 Deployment 17
Chapter 5 RESULT ANALYSIS 19
Chapter 6 CONCLUSIONS & FUTURE SCOPE

6.1 Work Conclusions 21
6.2 Future Scope of Work 21
REFERENCES 22
ANNEXURES 23
PROJECT DETAILS 27
6
CHAPTER 1
INTRODUCTION
1.1. Motivation
Aid to the blind - We can create a product for the blind which will guide them travelling
on the roads without the support of anyone else.We can do this by first converting the scene
into text and then the text to voice. Both are now famous applications of Deep Learning.
Self driving cars - Automatic driving is one of the biggest challenges and if we can
properly caption the scene around the car, it can give a boost to the self driving system.
1.2. Objectives of the Project

.
Fig 1.1 Demo Image

What do you see in the image?
Well some of us might say “A white dog in a grassy area”, some may say “White dog with
brown spots”, or “A dog with pink flowers in grass”. Definitely all these captions are
relevant and there may be more captions to it. The point is that as a human being we can
describe an image by just having a glance at it. But can you write a computer program that
feeds an image as an input and gives a caption to it. Now this problem can be solved very
easily using Deep Learning if we have the required dataset.
After completing this project, we will be able to achieve following objectives:
● How to prepare photo and text data for training a deep learning model.
● How to design and train a deep learning caption generation model.
● How to evaluate a train caption generation model and use it to caption entirely new
photographs.
7
CHAPTER 2
LITERATURE SURVEY
2.1 Previous Work
1. State-of-the-art CNN models for image classification are judged by their performance on
the ImageNet challenge. They have proceeded incrementally since Krizhevsky et al., 2012
introduced AlexNet in 2012, with modifications to the architecture that have improved
performance. In 2014, Szegedy et al., 2015 introduced GoogLeNet, which was an
improvement on AlexNet, mainly through greatly reducing the number of parameters
involved. Also, in 2014, Simonyan & Zisserman, 2014 introduced the VGGNet, which
achieved good performance because of the depth of the network. Most recently, in 2015,
He et al., 2015 introduced the ResNet, which utilizes “skip connections” and batch
normalization. The performance of these models on ImageNet is shown below
Table 2.1 Error rates on ImageNet challenge
2. The image captioning problem has also seen a lot of work in recent years. In the last year,
Karpathy & Fei-Fei, 2015 introduced a method that combines a pre-trained CNN
(VGGNet) as a feature extractor, a Markov Random Field (MRF) model for alignment,
and an RNN for generating text descriptions. The approach of Donahue et al., 2015,
similarly, uses a pre-trained VGGNet as a feature extractor, and directly inputs these
feature vectors and the word embedding vectors for sentences at each time step of a Long
Short-Term Memory (LSTM) model, which was introduced by Hochreiter &
Schmidhuber, 1997. In Vinyals et al., 2015, Vinyals et al. improved on this approach by
only inputting the image vector at the first time step of the LSTM, which they found to
improve the results.
3. Current image captioning approaches generate descriptions which lack specific

information, such as named entities that are involved in the images. Here Di Lu, Spencer
8
Whitehead had proposed a very new task which generates descriptive image captions,
given images as input. A simple solution to this problem that we are proposing is that we
will train a CNN-LSTM model so that it can generate a caption based on the image.
4. A large amount of work has been done on image caption generation tasks. The first
significant work in solving image captioning tasks was done by Ali Farhadi where three
spaces are defined namely the image space, meaning space and the sentence space where
mapping is done from the respective image and sentence space to the meaning space.
9
CHAPTER 3
METHODOLOGY
3.1 Detailed Methodology
The task of image captioning can be divided into two modules logically – A CNN and an RNN.
The captioning is all about merging the two to get their most powerful attributes.
CNN(Convolutional Neural Network) preserves spatial information and recognizes objects in the
image. RNN(Recurrent Neural Network) works well with any kind of sequential data, such as
generating a sequence of words. So by merging the two, we can get a model that can find a pattern
and features in the images and then use the information to generate a description of those images.
There are many open source datasets available for this problem, like Flickr 8k(containing 8k
images), Flickr30k (containing 30k images), MSCOCO (containing 180k images) etc. We have
used Flicker 8K dataset in this project. The dataset consists of two categories: Flicker8K_Dataset
(Contains 8092 photograph in jpeg format), Flicker8K_text (Contains a no. of files containing
different sources of descriptions for the photographs). The dataset has a pre-defined training
set(6000 images), validation set(1000 images), test set(1000 images).
The image below summarizes the approach given above:
Fig 3.1 Describing Image
Usually, a pretrained CNN extracts the features from our input image. The feature vector is
linearly transformed to have the same dimension as the input dimension of the RNN/LSTM
network. This network is trained as a language model on our feature vector.
For training our LSTM model, we predefine our label and target text. For example, if the caption
is :
“A man and a girl sit on the ground and eat.”, our label and target would be as follows –
10
Label – [ <start>, A, man, and, a, girl, sit, on, the, ground, and, eat, . ]
Target – [ A, man, and, a, girl, sit, on, the, ground, and, eat, ., <end> ]
This is done so that our model understands the start and end of our labelled sequence.
Fig 3.2 Automatic Image Captioning using Recurrent Neural Network
KeyWords:
Convolutional Neural Network(CNN): A Convolutional neural network (CNN) is a neural

network that has one or more convolutional layers and are used mainly for image processing,
classification, segmentation and also for other auto correlated data. A convolution is essentially
sliding a filter over the input.
Embedding Layer: Embedding is used as the input layer of the machine learning model. As a
machine learning model only works with numbers,so first different words are converted to
numbers and then it is feed to the model.So basically it’s a dense vector of fixed size.
Long Short Term Memory(LSTM) : LSTM is a kind of recurrent neural network. In RNN
output from the last step is fed as input in the current step. LSTM tackled the problem of
long-term dependencies of RNN where RNN cannot predict the word stored in the long term
memory but can give more accurate predictions from the recent information. LSTM can hold
information for a long period of time. It is used for processing, predicting and classifying
sequence data.
11
CHAPTER 4
IMPLEMENTATION
We are going to implement Image Captioning in the following steps:

1.Prepare Photo Data
2.Prepare Text Data
3.Develop Deep Learning Model
4.1 Prepare Photo Data

We will use a pre-trained model to interpret the content of the photos. There are many
models to choose from. In this case, we will use the Oxford Visual Geometry Group, or
VGG, model that won the ImageNet competition in 2014. Keras provides this pre-trained
model directly. We could use this model as part of a broader image caption model. The
problem is, it is a large model and running each photo through the network every time is
redundant. Instead, we can pre-compute the “photo features” using the pre-trained model
and save them to file. We can then load these features later and feed them into our model
as the interpretation of a given photo in the dataset. This is an optimization that will make
training our models faster and consume less memory.
Fig 4.1 Architecture of VGG16 model
We can load the VGG model in Keras using the VGG class. We will remove the last layer
from the loaded model, as this is the model used to predict a classification for a photo. We
are not interested in classifying images, but we are interested in the internal representation
of the photo right before a classification is made. These are the “features” that the model
has extracted from the photo. Keras also provides tools for reshaping the loaded photo into
the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).
12
4.2 Prepare Text Data
The dataset contains multiple descriptions for each photograph and the text of the
descriptions requires some minimal cleaning. Each photo has a unique identifier. This
identifier is used on the photo filename and in the text file of descriptions. We need to map
each photo identifier to a list of one or more textual descriptions.
We will clean the text in the following ways in order to reduce the size of the vocabulary
of words we will need to work with:
● Convert all words to lowercase.

● Remove all punctuation.
● Remove all words that are one character or less in length (e.g. ‘a’).
● Remove all words with numbers in them.
Once cleaned, we can summarize the size of the vocabulary. Ideally, we want a vocabulary
that is both expressive and as small as possible. A smaller vocabulary will result in a
smaller model that will train faster.
Fig 4.2 LSTM Architecture
4.3 Developing Deep Learning Model

This section is divided into the following parts:
1. Loading Data.
2. Defining the Model.
3. Fitting the Model.
4.3.1 Loading Data
First, we must load the prepared photo and text data so that we can use it to fit the model.
We are going to train the data on all of the photos and captions in the training dataset.
While training, we are going to monitor the performance of the model on the
development/validation set. The train and development dataset have been predefined in the
Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt files respectively, that both contain
13
lists of photo file names. From these file names, we can extract the photo identifiers and
use these identifiers to filter photos and descriptions for each set.
The model we will develop will generate a caption given a photo, and the caption will be
generated one word at a time. The sequence of previously generated words will be
provided as input. Therefore, we will use ‘startseq’ to kick-off the generation process and
‘endseq ‘ to signal the end of the caption. It is important to do this now before we encode
the text so that the tokens are also encoded correctly. Next, we can load the photo features
for a given dataset and after that we will encode the text.
Each description will be split into words. The model will be provided one word and the
photo and generate the next word. Then the first two words of the description will be
provided to the model as input with the image to generate the next word. This is how the
model will be trained.
For example, the input sequence “little girl running in field” would be split into 6
input-output pairs to train the model:
X1 X2 (text sequence) y (word)

photo startseq, little
photo startseq, little, girl
photo startseq, little, girl, running
photo startseq, little, girl, running, in
photo startseq, little, girl, running, in, field
photo startseq, little, girl, running, in, field, endseq
Table 4.1 Input Output Sequence
Later, when the model is used to generate descriptions, the generated words will be
concatenated and recursively provided as input to generate a caption for an image.
The function below named create_sequences(), given the tokenizer, a maximum sequence
length, and the dictionary of all descriptions and photos, will transform the data into
input-output pairs of data for training the model. There are two input arrays to the model:
one for photo features and one for the encoded text. There is one output for the model
which is the encoded next word in the text sequence.
def create_sequences(tokenizer, max_length, descriptions, photos, vocab_size):

X1, X2, y = list(), list(), list()
for key, desc_list in descriptions.items():
for desc in desc_list:
seq = tokenizer.texts_to_sequences([desc])[0]
for i in range(1, len(seq)):
in_seq, out_seq = seq[:i], seq[i]
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
X1.append(photos[key][0])
X2.append(in_seq)
y.append(out_seq)
return array(X1), array(X2), array(y)
We now have enough to load the data for the training and development datasets and
transform the loaded data into input-output pairs for fitting a deep learning model.
14
4.3.2 Defining the model
We will define a deep learning based on the “merge-model” :
We will describe the model in three parts:
● Photo Feature Extractor. This is a 16-layer VGG model pre-trained on the ImageNet
dataset. We have pre-processed the photos with the VGG model (without the output
layer) and will use the extracted features predicted by this model as input. The Photo
Feature Extractor model expects input photo features to be a vector of 4,096 elements.
These are processed by a Dense layer to produce a 256 element representation of the
photo.
● Sequence Processor. This is a word embedding layer for handling the text input,
followed by a Long Short-Term Memory (LSTM) recurrent neural network layer. The
Sequence Processor model expects input sequences with a predefined length (34
words) which are fed into an Embedding layer that uses a mask to ignore padded
values. This is followed by an LSTM layer with 256 memory units.
● Decoder. Both the feature extractor and sequence processor output a fixed-length
vector. These are merged together and processed by a Dense layer to make a final
prediction. The Decoder model merges the vectors from both input models using an
addition operation. This is then fed to a Dense 256 neuron layer and then to a final
output Dense layer that makes a softmax prediction over the entire output vocabulary
for the next word in the sequence.
Both the input models (Photo Feature Extractor and Sequence processor) produce a 256
element vector. Further, both input models use regularization in the form of 50% dropout. This
is to reduce overfitting the training dataset, as this model configuration learns very fast.
15
4.3.3 Fitting the model
Now that we know how to define the model, we can fit it on the training dataset.
The model learns fast and quickly overfits the training dataset. For this reason, we will
monitor the skill of the trained model on the holdout development dataset. When the skill
of the model on the development dataset improves at the end of an epoch, we will save the
whole model to file. At the end of the run, we can then use the saved model with the best
skill on the training dataset as our final model.
16
4.4 Deployment
4.4.1. Integration of our model with web-app using flutter framework :
Live Camera feed
Feed image through Camera
Feed image through

Gallery
Fig4.5 UI/UX of our app
Fig4.6 Code Structure of our app Fig4.7 Uploaded Image Prediction
17
4.4.2 Exposing our model to Containerized world through OpenShift
Fig4.8 Output of our model through Openshift API
18
CHAPTER 5
RESULT AND ANALYSIS
Our model correctly outputs the captions in the right sequence with accuracy of 71% for unbiased
model and 97% for 20 epochs of biased model.
We have achieved a BLEU score 0.683 for our model. BLEU for short, is a metric for evaluating a
generated sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a
perfect mismatch results in a score of 0.0
Fig 5.1 Input Image
Fig 5.2 Output of the above Input Image
19
We tried to propose different models ( CNN+RNN , CNN+LSTM , CNN+RNN+LSTM )
in which CNN+LSTM model gave the best accuracy.
Table 5.1 Accuracy of different models
Table 5.2 Accuracy of Different Evaluation Matrices
20
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
6.1.Conclusion of work
We have implemented a deep learning approach for the captioning of images. We have used
the Tensorflow framework to implement our deep learning model to achieve an effective
BLEU score 0.683 for our model. BLEU for short, is a metric for evaluating a generated
sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a perfect
mismatch results in a score of 0.0. We have used PreTrained Photo Models to improve the
feature extraction of the model. We have done lots of word processing and converted it into
word vectors using the Word Embedding layer before input to short-term memory (LSTM)
recurrent neural network layer. The configuration of the model was tuned, but other alternate
configurations can be trained to see for improvement in the performance of the image
captioning model.
6.2. Future scope

For future work, we propose the following possible improvements:
1. As our project can infer/generate text from images we can improve our accuracy of our
model by making it domain specific.
For example :
1. SkinVision : Lets you confirm whether a skin condition can be skin cancer or not.
2. Google Photos: Classify your photo into Mountains, sea etc.
3. FedEx and other courier services: Are using handwritten digit recognition systems to
detect pin code correctly.
2. CNN does not take into account the orientational and the spatial relationship of the
features so tried to work in this domain.
3. Advance processing to make the caption semantics as clear as possible and consistent with
the given image content.
21
REFERENCES
Journal / Conference Papers

[1] Farhadi A. et al. (2010) Every Picture Tells a Story: Generating Sentences from Images. In:
Daniilidis K., Maragos P., Paragios N. (eds) Computer Vision –ECCV 2010. ECCV 2010. Lecture
Notes in Computer Science, vol 6314. Springer, Berlin, Heidelberg.
[2] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and Tell: Lessons Learned from the
2015 MSCOCO Image Captioning Challenge," in IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 4, pp. 652-663, 1 April 2017.described by Marc Tanti, et al. in
their 2017 papers:
[3]Where to put the Image in an Image Caption Generator, 2017
[4]What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?, 2017
[5] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing systems, pp.
1097–1105, 2012.
[6] LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson, Donnie, Howard, Richard E,
Hubbard, Wayne, and Jacke l, Lawrence D. Backpropagation applied to handwritten zip code
recognition. Neural computation, 1(4):541–551, 1989.
[7] LeCun, Yann, Bottou, L´eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[8] Steinkrau, Dave, Simard, Patrice Y, and Buck, Ian. Using gpus for machine learning
algorithms. In null, pp. 1115–1119. IEEE, 2005.
[9] Karpathy, Andrej and Fei-Fei, Li. Deep visual semantic alignments for generating image
descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3128–3137, 2015.
[10]Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556, 2014. URL http: //arxiv.org/abs/1409.1556
22
ANNEXURES
Product Data sheets : There are many dataset available for image captioning .The most
common used datasets are :
1. COCO - The dataset contains over 330k images, each of which has at least 5 different
caption annotations with training data is itself of 13 GB
2. The VizWiz-Captions dataset includes: 23,431 training images; 117,155 training captions;
7,750 validation images; 38,750 validation captions; 8,000 test images.
3. InstaPic-1 , Visual GEnome , MSVD etc
Datasets used in different case studies
Evaluation Mechanism : Evaluating the trained model is quite difficult task in

image captioning for this purpose various evaluation matrices are created. Most common
evaluation mechanisms found in literature are BLEU, ROUGE-L, CIDEr, METEOR, and SPICE.
It is found that BLEU score is most popular method of evaluation used by almost all of the
studies.
1. BELU: BLEU stands for bilingual evaluation understudy. It is an evaluation mechanism
widely use in text generation. It is a mechanism for comparing the machine generated text
with one or more manually written text. So basically it summarizes that how close a
generated text is to an expected text. The score scale lies between 0.0 to 1.0. Where 1.0 is
perfect score and 0.0 is worst score.We found that almost all studies used bleu as their
23
evaluation matrix and they calculated BLEU-1 to 4 where BLEU-1 is calculating accuracy
only on 1 gram, BLEU-2 for 2 grams, BLEU-3 for 3 grams and BLEU-4 for 4 grams.
2. METEOR: METEOR stands for metric for evaluation and translation with explicit
ordering. While BLEU takes account of entire text generated overshadowing the score of
each and individual sentence generated the METEOR takes care of that. For doing so
METEOR enhances the precision and recall functions. Instead of precision and recall the
meteor utilizes weighted F-score for mapping unigram and for incorrect word order it uses
penalty function.
3. ROUGE-L: ROUGE stands for recall oriented understudy for gisting evaluation. As clear
from its name ROUGE is only based on recall but ROUGE-L is based on its F score which
is harmonic mean of its precision and recall values.
Codes(Algorithms)
Code snippet to predict description of image :
generate_desc: function to
generate caption of image
tokenizer: encode the input text
word_for_id: generate word for

given id
24
Photo Feature Extraction Code :
Text Processing Code
25
Github Link : https://github.com/sarthak-sriw/image_captiongenerator
Colab link :
https://github.com/sarthak-sriw/image_captiongenerator/blob/main/Image_caption_
generator.ipynb
26
PROJECT DETAILS
Student Details
Student Name Shubham Chandra Poddar
Register Number 2017UGEC048R Section / Roll No 048
Email Address subh.btech.ec17@iiitranchi Phone No (M) 8877042933
.ac.in
Project Details
Project Title Image Caption Generator
Project Duration 5 months Date of reporting 10/05/2021
Internal Guide Details
Faculty Name Dr. Dhananjoy Bhakta
Full contact address IIIT Ranchi ,Namkum,Ranchi,834010
with pin code
Email address bhaktadhananjoy@iii Phone No (M) 6290262605
tranchi.ac.in
Student Details
Student Name Sarthak Srivastava
Register Number 2017UGEC031R Section / Roll No 031
Email Address sarthak.btech.ec17@iiitran Phone No (M) 7985260261
chi.ac.in
Project Details
Project Title Image Caption Generator
Project Duration 5 months Date of reporting 10/05/2021
Internal Guide Details
Faculty Name Dr. Dhananjoy Bhakta
Full contact address IIIT Ranchi ,Namkum,Ranchi,834010
with pin code
Email address bhaktadhananjoy@iii Phone No (M) 6290262605
tranchi.ac.in
27

Image Caption Generator Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Caption Generator Report

Uploaded by

Copyright:

Available Formats

भारतीय सच

ू ना प्रौद्योगिकी संस्थान राँची

Image Caption Generator

(SHUBHAM CHANDRA PODDAR)

Under the Supervision of

Dr. Dhananjoy Bhakta

Dr. Dhananjoy Bhakta

Dr. Shashi Kant Sharma

Firstly, we would like to thank IIIT Ranchi for giving us an opportunity

We have Developed a Image Caption Generator which is a challenging

It requires both method from Computer Vision(CNN) to understand the

The dataset consists of two categories: Flicker8K_Dataset (Contains

Table No. Table Title Page No.

Chapter 2 LITERATURE SURVEY

Chapter 5 RESULT ANALYSIS 19

Chapter 6 CONCLUSIONS & FUTURE SCOPE

1.2. Objectives of the Project

Fig 1.1 Demo Image

After completing this project, we will be able to achieve following objectives:

2.1 Previous Work

Table 2.1 Error rates on ImageNet challenge

3. Current image captioning approaches generate descriptions which lack specific

3.1 Detailed Methodology

Fig 3.1 Describing Image

Fig 3.2 Automatic Image Captioning using Recurrent Neural Network

Convolutional Neural Network(CNN): A Convolutional neural network (CNN) is a neural

We are going to implement Image Captioning in the following steps:

4.1 Prepare Photo Data

Fig 4.1 Architecture of VGG16 model

● Convert all words to lowercase.

Fig 4.2 LSTM Architecture

4.3 Developing Deep Learning Model

4.3.1 Loading Data

X1 X2 (text sequence) y (word)

Table 4.1 Input Output Sequence

def create_sequences(tokenizer, max_length, descriptions, photos, vocab_size):

We will define a deep learning based on the “merge-model” :

We will describe the model in three parts:

Live Camera feed

Feed image through Camera

Feed image through

Fig4.5 UI/UX of our app

Fig4.6 Code Structure of our app Fig4.7 Uploaded Image Prediction

Fig4.8 Output of our model through Openshift API

Fig 5.1 Input Image

Fig 5.2 Output of the above Input Image

Table 5.1 Accuracy of different models

Table 5.2 Accuracy of Different Evaluation Matrices

6.2. Future scope

Journal / Conference Papers

[3]Where to put the Image in an Image Caption Generator, 2017

3. InstaPic-1 , Visual GEnome , MSVD etc

Datasets used in different case studies

Evaluation Mechanism : Evaluating the trained model is quite difficult task in

tokenizer: encode the input text

word_for_id: generate word for

Text Processing Code

You might also like