IMAGE CAPTIONING
MINI PROJECT REPORT
Submitted by
MAGESHKUMAR S (212221060155)
MOKESH M (212221060176)
SANJAY K (212221060240)
in partial fulfillment for the award of the
degree of
BACHELOR OF ENGINEERING
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
SAVEETHA ENGINEERING COLLEGE
(AUTONOMOUS)
AFFILIATED TO ANNA UNIVERSITY THANDALAM,
CHENNAI-600 025
JUNE 2023
i
SAVEETHA ENGINEERING COLLEGE (AUTONOMOUS),
CHENNAI
ANNA UNIVERSITY, CHENNAI – 600 025
BONAFIDE CERTIFICATE
Certified that this project report “IMAGE CAPTIONING” is the Bonafide
work of “MAGESH KUMAR S (212221060155), MOKESH M
(212221060176), SANJAY K (212221060240)” who carried out the project
work under my/our supervision.
SIGNATURE SIGNATURE
Dr. C. Sheeba Joice, M.E., M.B.A, PhD Dr. N. Sugitha,M.E.,PhD
HEAD OF THE DEPARTMENT SUPERVISOR
ELECTRONICS AND COMMUNICATION Associate Professor
ENGINEERING ELECTRONICS AND
COMMUNICATION
ENGINEERING
Submitted for the project viva-voice examination held on_____________________
INTERNAL EXAMINER EXTERNAL EXAMINER
ii
MINI PROJECT APPROVAL SHEET
The Mini project sheet “IMAGE CAPTIONING” submitted by
‘MOKESH M (212221060176), SANJAY K (212221060240),
MAGESH KUMAR S (212221060155) ’ is approved for submission,
as partial requirement for the award of the Degree of Bachelor of
Engineering in Electronics and Communication, Anna University
during the academic year 2022- 2023.
Submitted for the University Project Viva Voice examination held on
.
INTERNAL EXAMINER EXTERNAL EXAMINER
iii
ABSTRACT
The aim for image captioning is to automatically create the description
of an image using any natural language. With the development of deep
learning, means of using it to understand image content and generate
descriptive text. The objective of our project is to learn the concepts
of a CNN and LSTM model and build a working model of Image caption
generator by implementing CNN with LSTM . A convolutional neural
networks (CNN) is employed to extract image features as the coding
layer, the long short-term memory (LSTM)-attend is used to decode
the multilayer dense attention model, and the description text is
generated. . The image will be converted into a multi-feature dataset,
characterizing its distinctive features. The analysis will be carried out
on the popular Flickr8K dataset. The experimental results in the field
of general images validate the model’s good ability to understand
images and generating text. Finally, this paper highlights some open
challenges in the image caption task.
iv
ACKNOWLEDGMENT
We convey our sincere thanks to Dr.N.M.Veeraiyan - President(SMET)
and Chancellor-SIMATS, Saveetha Amaravathi University, Dr.S.Rajesh,
Director - Saveetha Engineering College and Dr. V. Saveetha Rajesh –
Director, Saveetha Medical College and Hospital for providing us with the
facilities for the completion of our project. We are grateful to our Principal,
Dr.N.Duraipandian M.E,Ph.D., for his continuous support and
encouragement in carrying out our project work. We are deeply indebted to
our beloved Head of the Department, Dr.Sheeba Joice, M.E., M.B.A, Ph.D
Department of Electronics and Communication, for giving us the
opportunity to display our professional skills through this project.
We are greatly thankful to our Project Coordinator, Dr.S. Asha, M.Tech,
Ph.D and our Project Guide Dr.M. Vanitha, M.E., Ph.D for their
valuable guidance and motivation which helped to complete our project on
time.
We thank all our teaching and non- teaching faculty members of the
Department of Electronics and Communication for their passionate
support, for helping us to identify our mistakes and also for the
appreciation they gave us. We heartily thank our library staff and the
management for their extensive support in providing the resources and
information that helped us to complete the project successfully. Also, we
would like to record our deepest gratitude to our parents for their constant
encouragement and support, which motivated us a lot to complete our
project work.
v
TABLE OF CONTENTS
CHAPTER TITLE PAGE
NO. NO.
ABSTRACT iv
LIST OF FIGURES viii
LIST OF ABBREVIATIONS x
1. INTRODUCTION 1
1.1 OVERVIEW 1
1.2 LSTM 1
1.3 CNN MODEL 2
1.3.1 ARCHTECTURE OF CNN 3
1.4 RNN MODEL 3
1.5 ARCHITECTURE OF LTSM 4
1.6 BACKEND 4
2. LITERATURE SURVEY 5
3. REQUIREMENT ANALYSIS 7
3.1 HARDWARE REQUIREMENT 7
3.1.1 GPU 7
3.1.2 TPU 8
3.2 SOFTWARE REQUIREMENT 8
3.2.1 TENSORFLOW 8
3.2.2 KERAS 8
4. SYSTEM ANALYSIS AND DESIGN 9
4.1 EXISTING SYSTEM 9
4.2 DRAWBACKS OF CAPTIONBOT 9
4.3 PROPOSED SYSTEM 9
vi
4.4 UML DIAGRAMS 10
4.4.1 USE CASE DIAGRAM 10
4.4.2 ACTIVITY DIAGRAM 11
4.4.3 ARCHITECTURE FOR IMAGE
CAPTIONING 12
5 SYSTEM IMPLEMENTATION 13
5.1 IMPORTING THE PACKAGES 13
5.2 GETTING AND PERFORMING 13
DATA CLEANING
5.2.1 LOAD DOC 13
5.2.2 ALL IMG CAPTION 13
5.2.3 CLEANING TEXT 14
5.2.4 TEXT VOCABULARY 14
5.2.5 SAVE DESCRIPTION 14
5.3 EXTRACTING FEATURE 15
VECTOR FROM ALL IMAGES
5.4 LOADING DATASET FOR 15
TRAINING THE MODEL
5.5 TOKENIZING THE VOCABULARY 15
5.6 CREATE DATA GENERATOR 16
6 PERFORMANCE EVALUATION 17
6.1 TESTING THE MODEL 17
6.2 RESULT 17
7 CONCLUSION AND FUTURE
ENHANCEMENT 18
7.1 CONCLUSION 18
7.2 FUTURE ENHANCEMENT 18
vii
APPENDIX PAGE NO
APPENDIX 1 SOURCE CODE 19
APPENDIX 2 SCREENSHOT 20
viii
LIST OF FIGURES
FIGURE NO. NAME PAGE NO.
1.3.1 ARCHITECTURE OF CNN 3
1.3 ARCHITECTURE OF LTSM 4
4.4.1 USE CASE DIAGRAM 10
4.4.2 ACTIVITY DIAGRAM 11
4.4.3 ARCHITECTURE DIAGRAM FOR IMAGE 12
CAPTIONING
6.2 RESULT 17
ix
LIST OF ABBREVIATIONS
CNN – Convolutional Neural Network
RNN – Recurrent Neural Network
LSTM – Long Short Term Memory
MLP - Multilayer Perceptron
x
CHAPTER 1
INTRODUCTION
1.3 OVERVIEW
Image captioning automatically generating natural description
according to the content observed in an image, is an important part
of scene understanding, which combines the knowledge of computer
vision and natural language processing.
1.2 LSTM
LSTM stands for Long short term memory, they are a type of RNN
which is well suited for sequence prediction problems. Based on the
previous text, we can predict what the next word will be. LSTM can
carry out relevant information throughout the processing of inputs
and with a forget gate, it discards non-relevant information. LSTM
with a soft attention as the decoder which selectively focuses the
attention over a certain part of an image to predict the next sentences.
LSTM describes from input image, by processing already extracted
feature maps.The LSTM as the language model for the decoder to
decode the vector into a sentence.
ResNet50 has a reliable initialization for object recognition and
allows reducing training time. For any image from the training set,
we get the output vector representation from the last convolution
layer. This vector is fed to the LSTM input. The length of the
description may differ, the model should know where to start and
stop. To do this, we add two tokens < START > and < END >, which
1
are the beginning and end of each sign. The network for generating
the captions will have to capture the words between these
tokens.The LSTM model learns to predict the next word St in the
commentary based on the vector of visual features and the previous
t-1 words. LSTM provides the next state vector ht and next word.
The context vector zt is a concatenation of the feature vector and
one hot vector of word representation.
1.3 CNN MODEL
Convolutional Neural networks are specialized deep neural networks which can
process the data that has input shape like a 2D matrix. Images are easily
represented as a 2D matrix and CNN is very useful in working with images. CNN
is basically used for image classifications and identifying if an image is a bird, a
plane or Superman, etc. Convolutional Neural Network (CNN) layers are used
for feature extraction on input data.
CNN model uses Loopy filter. This filter will detect the features of
the given data and generates a feature map. The advantage of CNN
model is that it is location invariant ie) it can detect the feature even
if it is located in any other location on the feature map. It scans
images from left to right and top to bottom to pull out important
features from the image and combines the feature to classify images.
2
1.3.1 ARCHITECTURE OF CNN
1.4 RNN MODEL
RNNs are used in deep learning and in the development of models
that simulate the activity of neurons in the human brain. They are
especially powerful in use cases in which context is critical to
predicting an outcome. RNN use feedback loops to process a
sequence of data that informs the final output, which can also be a
sequence of data. They use feedback loops allow information to
persist; the effect is often described as memory.
3
1.5 ARCHITECTURE OF LSTM
1.6 BACKEND
For the image caption generator, we will be using the Flickr8K and
Flickr text dataset. Flickr8K contains 8091 images. Flickr text
contains text files and caption of images.
4
CHAPTER 2
LITERATURE SURVEY
2.1 LITERATURE SURVEY
Image captioning requires recognizing the important objects, there
attributes, and there relationship in an image. It also needs to
generate syntactically and semantically correct sentences. Deep-
learning-based techniques are capable of handling the complexity
and challenges of image captioning. In this survey, we aim to present
a comprehensive review of existing deep-learning-based image
captioning technique. Describing the content of an image is a
fundamental problem in artificial intelligence that connects
computer vision and natural language processing.
Convolutional Neural Networks (CNN) are biologically-inspired
variants of Multi Layered Perceptrons. It is comprised of one or
more convolution layers (often with a subsampling step) and then
followed by one or more fully connected layers as in a standard
multilayer neural network. The architecture of a CNN is designed
to take advantage of the 2D structure of an input image (or other 2D
input such as a speech signal). This is achieved with local
connections and tied weights followed by some form of pooling
which results in translation invariant features. Another benefit of
5
CNNs is that they are easier to train and have many fewer
parameters than fully connected networks with the same number of
hidden units. CNN have been widely used and studied for image
tasks, and are currently state-of-the art for object recognition and
detection.
Recurrent Neural Networks (RNNs) are models that have shown
great promise in many NLP tasks. The concept of RNNs is to make
use of sequential information. In a traditional neural network we
assume that all inputs (and outputs) are independent of each other.
But for many tasks that’s not effective. If you want to predict the
next word in a sentence you have to know which words came before
it. RNNs are called recurrent because they perform the same task
for every element of a sequence, with the output being depended on
the previous computations. Alternatively RNNs can be thought of as
networks that have a “memory” which captures information about
what has been calculated so far.
Our approach is to infer these alignments and use them to learn a generative
model of descriptions. We develop a deep neural network model that infers the
alignment between segments of sentences and the region of the image that they
describe. We introduce a Recurrent Neural Network architecture that takes
aninput image and generates its description in text. Our experiments show that
the generated sentences produce sensible qualitative predictions.
6
CHAPTER 3
REQUIREMENT ANALYSIS
3.1 HARDWARE REQUIREMENT
The science and methodology behind deep learning have been in
existence for decades. In recent years, however, there has been a
significant acceleration in the utilization of deep learning due to an
increasing abundance of digital data and the involvement of the
powerful hardware.
3.1.1 GPU
A Graphical Processing Unit is a specialized electronic circuit designed to rapidly
manipulate and alter memory to accelerate the creation of images in a frame
buffer intended for output to a display device. These are used in embedded
system, mobile phones, personal computer, workstation and game consoles.
Compared to CPU, the performance of matrix multiplication on Graphics
Processing Unit is significantly better. With GPU computing resources, all the
deep learning tools mentioned achieve much higher speedup when compared to
their CPU-only versions. GPUs have become the platform of choice for training
large, complex Neural Network based systems because of their ability to
accelerate the systems. There is no single software tool that can consistently
outperform others, however, which implies that there exist some opportunities to
further optimize the performance.
7
3.1.2 TPU
Tensor Processing Unit is an AI accelerator application-specific integrated circuit
developed by google specifically for neural network machine learning. The goal
was to run whole inference models in the TPU to reduce I/O between the TPU
and the host CPU.
3.2 SOFTWARE REQUIREMENT
3.2.1 TENSORFLOW
TensorFlow is a free and open-source software library for machine
learning. It is a symbolic math library based in dataflow and
differentiable programming. TensorFlow is designed for remarkable
flexibility, portability, and high efficiency of equipped hardware.
3.2.2 KERAS
Keras is an open-source software library that provides a python interface for
artificial neural networks. It acts an interface for the TensorFlow library. Keras
allows for easy and fast prototyping . Keras runs seamlessly on CPU and GPU. It
was developed with a focus on enabling fast experimentation. The
ImageDataGenerator class provided by the Keras API is nothing but an
implementation of generator function in Python.
8
CHAPTER 4
SYSTEM ANALYSIS AND DESIGN
4.1 EXISTING SYSTEM
Microsoft have published an image captioning experiment on the
internet “CaptionBot”. The idea is that you upload a photo to the
service, and it tries to automatically generate a caption. It accurately
detects what’s was on the display. The computer vision API
identifies the components of the photo, it mixes that with data from
the Bing image search API, and runs any faces it spots through their
emotion API. This analyses human facial expressions to detect
anger, contempt, fear, happiness, sadness or surprise.
4.2 DRAWBACKS OF CAPTIONBOT
Sometime the output is not accurate. It detects the unwanted feature.
It covers only small percentage of the visual concept. Sometime it
fails to describe the image.
4.3 PROPOSED SYSTEM
Image caption generator is a task that involves computer vision and
natural language processing concepts to recognize the context of an
image and describe them in a natural language like English. The
concepts of a CNN and LSTM model and build a working model of
Image caption generator by implementing CNN with LSTM.
9
4.4 UML DIAGRAMS
4.4.1 USE CASE DIAGRAM
10
4.4.2 ACTIVITY DIAGRAM
11
4.4.3 ARCHITECTURE DIAGRAM FOR IMAGE
CAPTIONING
12
CHAPTER 5
SYSTEM IMPLEMENTATION
5.1 IMPORTING THE PACKAGES
Here we need to import all the necessary packages. The packages
are numpy, string, image, dump, xception, os, keras, embedding,
LSTM.
5.2 GETTING AND PERFORMING DATA CLEANING
The main text file which contains all image captions is
Flickr8k.token in our Flickr_8k_text folder. The format of our
file is image and caption separated by a new line (“\n”). Each
image has 5 captions and we can see that #(0 to 5)number is
assigned for each caption. Here we will define 5 functions
Load doc , All img caption , Cleaning text , Text vocabulary ,
Save description.
5.2.1 LOAD DOC
For loading the document file and reading the contents inside the file
into a string.
5.2.2 ALL IMG CAPTION
This function will create a descriptions dictionary that maps images
with a list of 5 captions.
13
5.2.3 CLEANING TEXT
This function takes all descriptions and performs data cleaning. This
is an important step when we work with textual data, according to
our goal, we decide what type of cleaning we want to perform on
the text. In our case, we will be removing punctuations, converting
all text to lowercase and removing words that contain numbers.
5.2.4 TEXT VOCABULARY
This is a simple function that will separate all the unique words and
create the vocabulary from all the descriptions.
5.2.5 SAVE DESCRIPTION
This function will create a list of all the descriptions that have been
pre-processed and store them into a file. We will create a
descriptions.txt file to store all the captions.
5.3 EXTRACTING THE FEATURE VECTOR FROM ALL
IMAGES
This technique is also called transfer learning, we don’t have to do
everything on our own, we use the pre-trained model that have been
already trained on large datasets and extract the features from these
models and use them for our tasks. We are using the Xception model
which has been trained on imagenet dataset that had 1000
14
different classes to classify. We can directly import this model from
the keras.applications . Make sure you are connected to the internet
as the weights get automatically downloaded. Since the Xception
model was originally built for imagenet, we will do little changes
for integrating with our model. One thing to notice is that the
Xception model takes 299*299*3 image size as input. We will
remove the last classification layer and get the 2048 feature vector.
model = Xception( include_top=False, pooling=’avg’ ) .The
function extract_features() will extract features for all images and
we will map image names with their respective feature array. Then
we will dump the features dictionary into a “features.p” pickle file.
5.4 LOADING DATASET FOR TRAINING THE MODEL
In Flickr8ktest folder, we have Flickr8k.trainImages.txt file that
contains a list of 6000 image names that we will use for training.For
loading the training dataset, we need more functions:
• Load photos– This will load the text file in a string and will
return the list of image names.
• Load clean descriptions – This function will create a dictionary
that contains captions for each photo from the list of photos. We
also append the <start> and <end> identifier for each caption.
We need this so that our LSTM model can identify the starting
and ending of the caption.
• Load features – This function will give us the dictionary for
image names and their feature vector which we have previously
extracted from the Xception model.
15
5.5 TOKENIZING THE VOCABULARY
Computers don’t understand English words, for computers, we will
have to represent them with numbers. So, we will map each word of
the vocabulary with a unique index value. Keras library provides us
with the tokenizer function that we will use to create tokens from
our vocabulary and save them to a “tokenizer.p” pickle file.
5.6 CREATE DATA GENERATOR
Here we need to train our model on 6000 images and each image
will contain 2048 length feature vector and caption is also
represented as numbers. This amount of data for 6000 images is not
possible to hold into memory so we will be using a generator method
that will yield batches.
X1(FEATURE) X2(TEXT) Y(WORD TO
PREDICT)
FEATURE START TWO
FEATURE START,TWO DOGS
FEATURE START,TWO,DOGS DRINK
FEATURE START,TWO,DOGS,DRINKS WATER
FEATURE END
START,TWO,DOGS,DRINKS,WATER
16
CHAPTER 6
PERFORMANCE EVALUATION
6.1 TESTING THE MODEL
The model has been trained, now, we will make a separate file
testing_caption_generator.py which will load the model and
generate predictions. The predictions contain the max length of
index values so we will use the same tokenizer.p pickle file to get
the words from their index values. we encode all the test images and
save them in the file “encoded_test_images.pkl”.
6.2 RESULT
The above project was done and description for image came
successfully.
17
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENT
7.1 CONCLUSION
We have implemented a CNN-RNN model by building an image
caption generator. Some key points to note are that our model
depends on the data, so, it cannot predict the words that are out of its
vocabulary. This model is capable to autonomously view an image
and generate a reasonable description in natural language with
reasonable accuracy and naturalness. There is still work going on
for alternating Pre-Trained Photo Models to improve the feature
extraction of the model. And also planning to improve and achieve
better performance by using word vectors on a much larger corpus
of data.
7.2 FUTURE ENHANCEMENT
Future work is to develop segmentation-based visual explanation methods
and to compare them with state-of-the-art approaches like Grad-Cam.
There is still work going on for re- ranking.
18
APPENDIX 1
SOURCE CODE
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor,
AutoTokenizer
import torch
from PIL import Image
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-
captioning")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
images = []
for image_path in image_paths:
i_image = Image.open(image_path)
if i_image.mode != "RGB":
i_image = i_image.convert(mode="RGB")
images.append(i_image)
pixel_values=feature_extractor(images=images,
return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
preds = [pred.strip() for pred in preds]
return preds
predict_step(['sample2.jpg']) # ['a woman in a hospital bed with a woman in a
hospital bed']
19
APPENDIX 2
SCREENSHOT
20
REFERENCES
1. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice:
semantic propositional image caption evaluation
2. Wu Y, et al. (2016) Google’s neural machine translation system:
Bridging the gap between human and machine translation.
3. Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell :
neural image caption generator.
4. Schmidhuber J (2015) Deep learning in neural networks: An
overview of Neural Networks.
5. Hodosh M, Young P, Hockenmaier J (2013) Framing image
description.
6. Biswas R (2019) Diverse image caption generation and
automated human judgement through active learning.
7. Dudley JJ, Kristensson PO (2018) A review of user interface
design for interactive machine learning.
8. Gunning D, Aha D (2019) DARPA’s explainable artificial
intelligence (XAI) program.
21