You are on page 1of 31

IMAGE CAPTIONING

MINI PROJECT REPORT

Submitted by

MAGESHKUMAR S (212221060155)
MOKESH M (212221060176)
SANJAY K (212221060240)

in partial fulfillment for the award of the

degree of

BACHELOR OF ENGINEERING

IN

ELECTRONICS AND COMMUNICATION ENGINEERING

SAVEETHA ENGINEERING COLLEGE


(AUTONOMOUS)
AFFILIATED TO ANNA UNIVERSITY THANDALAM,
CHENNAI-600 025
JUNE 2023

i
SAVEETHA ENGINEERING COLLEGE (AUTONOMOUS),
CHENNAI
ANNA UNIVERSITY, CHENNAI – 600 025

BONAFIDE CERTIFICATE

Certified that this project report “IMAGE CAPTIONING” is the Bonafide

work of “MAGESH KUMAR S (212221060155), MOKESH M

(212221060176), SANJAY K (212221060240)” who carried out the project

work under my/our supervision.

SIGNATURE SIGNATURE

Dr. C. Sheeba Joice, M.E., M.B.A, PhD Dr. N. Sugitha,M.E.,PhD


HEAD OF THE DEPARTMENT SUPERVISOR
ELECTRONICS AND COMMUNICATION Associate Professor
ENGINEERING ELECTRONICS AND
COMMUNICATION
ENGINEERING

Submitted for the project viva-voice examination held on_____________________

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
MINI PROJECT APPROVAL SHEET

The Mini project sheet “IMAGE CAPTIONING” submitted by


‘MOKESH M (212221060176), SANJAY K (212221060240),
MAGESH KUMAR S (212221060155) ’ is approved for submission,
as partial requirement for the award of the Degree of Bachelor of
Engineering in Electronics and Communication, Anna University
during the academic year 2022- 2023.

Submitted for the University Project Viva Voice examination held on


.

INTERNAL EXAMINER EXTERNAL EXAMINER

iii
ABSTRACT

The aim for image captioning is to automatically create the description


of an image using any natural language. With the development of deep
learning, means of using it to understand image content and generate
descriptive text. The objective of our project is to learn the concepts
of a CNN and LSTM model and build a working model of Image caption
generator by implementing CNN with LSTM . A convolutional neural
networks (CNN) is employed to extract image features as the coding
layer, the long short-term memory (LSTM)-attend is used to decode
the multilayer dense attention model, and the description text is
generated. . The image will be converted into a multi-feature dataset,
characterizing its distinctive features. The analysis will be carried out
on the popular Flickr8K dataset. The experimental results in the field
of general images validate the model’s good ability to understand
images and generating text. Finally, this paper highlights some open
challenges in the image caption task.

iv
ACKNOWLEDGMENT

We convey our sincere thanks to Dr.N.M.Veeraiyan - President(SMET)


and Chancellor-SIMATS, Saveetha Amaravathi University, Dr.S.Rajesh,
Director - Saveetha Engineering College and Dr. V. Saveetha Rajesh –
Director, Saveetha Medical College and Hospital for providing us with the
facilities for the completion of our project. We are grateful to our Principal,
Dr.N.Duraipandian M.E,Ph.D., for his continuous support and
encouragement in carrying out our project work. We are deeply indebted to
our beloved Head of the Department, Dr.Sheeba Joice, M.E., M.B.A, Ph.D
Department of Electronics and Communication, for giving us the
opportunity to display our professional skills through this project.

We are greatly thankful to our Project Coordinator, Dr.S. Asha, M.Tech,


Ph.D and our Project Guide Dr.M. Vanitha, M.E., Ph.D for their
valuable guidance and motivation which helped to complete our project on
time.

We thank all our teaching and non- teaching faculty members of the
Department of Electronics and Communication for their passionate
support, for helping us to identify our mistakes and also for the
appreciation they gave us. We heartily thank our library staff and the
management for their extensive support in providing the resources and
information that helped us to complete the project successfully. Also, we
would like to record our deepest gratitude to our parents for their constant
encouragement and support, which motivated us a lot to complete our
project work.
v
TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO. NO.

ABSTRACT iv

LIST OF FIGURES viii

LIST OF ABBREVIATIONS x

1. INTRODUCTION 1

1.1 OVERVIEW 1
1.2 LSTM 1
1.3 CNN MODEL 2
1.3.1 ARCHTECTURE OF CNN 3
1.4 RNN MODEL 3
1.5 ARCHITECTURE OF LTSM 4
1.6 BACKEND 4

2. LITERATURE SURVEY 5

3. REQUIREMENT ANALYSIS 7

3.1 HARDWARE REQUIREMENT 7


3.1.1 GPU 7
3.1.2 TPU 8
3.2 SOFTWARE REQUIREMENT 8
3.2.1 TENSORFLOW 8
3.2.2 KERAS 8

4. SYSTEM ANALYSIS AND DESIGN 9

4.1 EXISTING SYSTEM 9


4.2 DRAWBACKS OF CAPTIONBOT 9
4.3 PROPOSED SYSTEM 9
vi
4.4 UML DIAGRAMS 10
4.4.1 USE CASE DIAGRAM 10
4.4.2 ACTIVITY DIAGRAM 11
4.4.3 ARCHITECTURE FOR IMAGE
CAPTIONING 12
5 SYSTEM IMPLEMENTATION 13
5.1 IMPORTING THE PACKAGES 13
5.2 GETTING AND PERFORMING 13
DATA CLEANING
5.2.1 LOAD DOC 13
5.2.2 ALL IMG CAPTION 13
5.2.3 CLEANING TEXT 14
5.2.4 TEXT VOCABULARY 14
5.2.5 SAVE DESCRIPTION 14
5.3 EXTRACTING FEATURE 15
VECTOR FROM ALL IMAGES
5.4 LOADING DATASET FOR 15
TRAINING THE MODEL
5.5 TOKENIZING THE VOCABULARY 15
5.6 CREATE DATA GENERATOR 16

6 PERFORMANCE EVALUATION 17
6.1 TESTING THE MODEL 17
6.2 RESULT 17

7 CONCLUSION AND FUTURE


ENHANCEMENT 18
7.1 CONCLUSION 18
7.2 FUTURE ENHANCEMENT 18

vii
APPENDIX PAGE NO

APPENDIX 1 SOURCE CODE 19

APPENDIX 2 SCREENSHOT 20

viii
LIST OF FIGURES

FIGURE NO. NAME PAGE NO.

1.3.1 ARCHITECTURE OF CNN 3


1.3 ARCHITECTURE OF LTSM 4

4.4.1 USE CASE DIAGRAM 10

4.4.2 ACTIVITY DIAGRAM 11

4.4.3 ARCHITECTURE DIAGRAM FOR IMAGE 12

CAPTIONING

6.2 RESULT 17

ix
LIST OF ABBREVIATIONS

CNN – Convolutional Neural Network

RNN – Recurrent Neural Network

LSTM – Long Short Term Memory

MLP - Multilayer Perceptron

x
CHAPTER 1

INTRODUCTION

1.3 OVERVIEW
Image captioning automatically generating natural description
according to the content observed in an image, is an important part
of scene understanding, which combines the knowledge of computer
vision and natural language processing.

1.2 LSTM

LSTM stands for Long short term memory, they are a type of RNN
which is well suited for sequence prediction problems. Based on the
previous text, we can predict what the next word will be. LSTM can
carry out relevant information throughout the processing of inputs
and with a forget gate, it discards non-relevant information. LSTM
with a soft attention as the decoder which selectively focuses the
attention over a certain part of an image to predict the next sentences.
LSTM describes from input image, by processing already extracted
feature maps.The LSTM as the language model for the decoder to
decode the vector into a sentence.

ResNet50 has a reliable initialization for object recognition and


allows reducing training time. For any image from the training set,
we get the output vector representation from the last convolution
layer. This vector is fed to the LSTM input. The length of the
description may differ, the model should know where to start and
stop. To do this, we add two tokens < START > and < END >, which

1
are the beginning and end of each sign. The network for generating
the captions will have to capture the words between these
tokens.The LSTM model learns to predict the next word St in the
commentary based on the vector of visual features and the previous
t-1 words. LSTM provides the next state vector ht and next word.
The context vector zt is a concatenation of the feature vector and
one hot vector of word representation.

1.3 CNN MODEL

Convolutional Neural networks are specialized deep neural networks which can
process the data that has input shape like a 2D matrix. Images are easily
represented as a 2D matrix and CNN is very useful in working with images. CNN
is basically used for image classifications and identifying if an image is a bird, a
plane or Superman, etc. Convolutional Neural Network (CNN) layers are used
for feature extraction on input data.

CNN model uses Loopy filter. This filter will detect the features of
the given data and generates a feature map. The advantage of CNN
model is that it is location invariant ie) it can detect the feature even
if it is located in any other location on the feature map. It scans
images from left to right and top to bottom to pull out important
features from the image and combines the feature to classify images.

2
1.3.1 ARCHITECTURE OF CNN

1.4 RNN MODEL

RNNs are used in deep learning and in the development of models


that simulate the activity of neurons in the human brain. They are
especially powerful in use cases in which context is critical to
predicting an outcome. RNN use feedback loops to process a
sequence of data that informs the final output, which can also be a
sequence of data. They use feedback loops allow information to
persist; the effect is often described as memory.

3
1.5 ARCHITECTURE OF LSTM

1.6 BACKEND

For the image caption generator, we will be using the Flickr8K and
Flickr text dataset. Flickr8K contains 8091 images. Flickr text
contains text files and caption of images.

4
CHAPTER 2
LITERATURE SURVEY

2.1 LITERATURE SURVEY

Image captioning requires recognizing the important objects, there


attributes, and there relationship in an image. It also needs to
generate syntactically and semantically correct sentences. Deep-
learning-based techniques are capable of handling the complexity
and challenges of image captioning. In this survey, we aim to present
a comprehensive review of existing deep-learning-based image
captioning technique. Describing the content of an image is a
fundamental problem in artificial intelligence that connects
computer vision and natural language processing.

Convolutional Neural Networks (CNN) are biologically-inspired


variants of Multi Layered Perceptrons. It is comprised of one or
more convolution layers (often with a subsampling step) and then
followed by one or more fully connected layers as in a standard
multilayer neural network. The architecture of a CNN is designed
to take advantage of the 2D structure of an input image (or other 2D
input such as a speech signal). This is achieved with local
connections and tied weights followed by some form of pooling
which results in translation invariant features. Another benefit of

5
CNNs is that they are easier to train and have many fewer
parameters than fully connected networks with the same number of
hidden units. CNN have been widely used and studied for image
tasks, and are currently state-of-the art for object recognition and
detection.

Recurrent Neural Networks (RNNs) are models that have shown


great promise in many NLP tasks. The concept of RNNs is to make
use of sequential information. In a traditional neural network we
assume that all inputs (and outputs) are independent of each other.
But for many tasks that’s not effective. If you want to predict the
next word in a sentence you have to know which words came before
it. RNNs are called recurrent because they perform the same task
for every element of a sequence, with the output being depended on
the previous computations. Alternatively RNNs can be thought of as
networks that have a “memory” which captures information about
what has been calculated so far.

Our approach is to infer these alignments and use them to learn a generative
model of descriptions. We develop a deep neural network model that infers the
alignment between segments of sentences and the region of the image that they
describe. We introduce a Recurrent Neural Network architecture that takes
aninput image and generates its description in text. Our experiments show that
the generated sentences produce sensible qualitative predictions.

6
CHAPTER 3

REQUIREMENT ANALYSIS

3.1 HARDWARE REQUIREMENT

The science and methodology behind deep learning have been in


existence for decades. In recent years, however, there has been a
significant acceleration in the utilization of deep learning due to an
increasing abundance of digital data and the involvement of the
powerful hardware.

3.1.1 GPU

A Graphical Processing Unit is a specialized electronic circuit designed to rapidly


manipulate and alter memory to accelerate the creation of images in a frame
buffer intended for output to a display device. These are used in embedded
system, mobile phones, personal computer, workstation and game consoles.
Compared to CPU, the performance of matrix multiplication on Graphics
Processing Unit is significantly better. With GPU computing resources, all the
deep learning tools mentioned achieve much higher speedup when compared to
their CPU-only versions. GPUs have become the platform of choice for training
large, complex Neural Network based systems because of their ability to
accelerate the systems. There is no single software tool that can consistently
outperform others, however, which implies that there exist some opportunities to
further optimize the performance.

7
3.1.2 TPU

Tensor Processing Unit is an AI accelerator application-specific integrated circuit


developed by google specifically for neural network machine learning. The goal
was to run whole inference models in the TPU to reduce I/O between the TPU
and the host CPU.

3.2 SOFTWARE REQUIREMENT

3.2.1 TENSORFLOW

TensorFlow is a free and open-source software library for machine

learning. It is a symbolic math library based in dataflow and

differentiable programming. TensorFlow is designed for remarkable

flexibility, portability, and high efficiency of equipped hardware.

3.2.2 KERAS

Keras is an open-source software library that provides a python interface for


artificial neural networks. It acts an interface for the TensorFlow library. Keras
allows for easy and fast prototyping . Keras runs seamlessly on CPU and GPU. It
was developed with a focus on enabling fast experimentation. The
ImageDataGenerator class provided by the Keras API is nothing but an
implementation of generator function in Python.

8
CHAPTER 4

SYSTEM ANALYSIS AND DESIGN

4.1 EXISTING SYSTEM

Microsoft have published an image captioning experiment on the


internet “CaptionBot”. The idea is that you upload a photo to the
service, and it tries to automatically generate a caption. It accurately
detects what’s was on the display. The computer vision API
identifies the components of the photo, it mixes that with data from
the Bing image search API, and runs any faces it spots through their
emotion API. This analyses human facial expressions to detect
anger, contempt, fear, happiness, sadness or surprise.

4.2 DRAWBACKS OF CAPTIONBOT

Sometime the output is not accurate. It detects the unwanted feature.


It covers only small percentage of the visual concept. Sometime it
fails to describe the image.

4.3 PROPOSED SYSTEM

Image caption generator is a task that involves computer vision and


natural language processing concepts to recognize the context of an
image and describe them in a natural language like English. The
concepts of a CNN and LSTM model and build a working model of
Image caption generator by implementing CNN with LSTM.

9
4.4 UML DIAGRAMS

4.4.1 USE CASE DIAGRAM

10
4.4.2 ACTIVITY DIAGRAM

11
4.4.3 ARCHITECTURE DIAGRAM FOR IMAGE

CAPTIONING

12
CHAPTER 5

SYSTEM IMPLEMENTATION

5.1 IMPORTING THE PACKAGES

Here we need to import all the necessary packages. The packages


are numpy, string, image, dump, xception, os, keras, embedding,
LSTM.

5.2 GETTING AND PERFORMING DATA CLEANING

The main text file which contains all image captions is


Flickr8k.token in our Flickr_8k_text folder. The format of our
file is image and caption separated by a new line (“\n”). Each
image has 5 captions and we can see that #(0 to 5)number is
assigned for each caption. Here we will define 5 functions
Load doc , All img caption , Cleaning text , Text vocabulary ,
Save description.

5.2.1 LOAD DOC

For loading the document file and reading the contents inside the file
into a string.

5.2.2 ALL IMG CAPTION

This function will create a descriptions dictionary that maps images

with a list of 5 captions.

13
5.2.3 CLEANING TEXT

This function takes all descriptions and performs data cleaning. This
is an important step when we work with textual data, according to
our goal, we decide what type of cleaning we want to perform on

the text. In our case, we will be removing punctuations, converting


all text to lowercase and removing words that contain numbers.

5.2.4 TEXT VOCABULARY

This is a simple function that will separate all the unique words and
create the vocabulary from all the descriptions.

5.2.5 SAVE DESCRIPTION

This function will create a list of all the descriptions that have been
pre-processed and store them into a file. We will create a

descriptions.txt file to store all the captions.

5.3 EXTRACTING THE FEATURE VECTOR FROM ALL


IMAGES
This technique is also called transfer learning, we don’t have to do
everything on our own, we use the pre-trained model that have been
already trained on large datasets and extract the features from these
models and use them for our tasks. We are using the Xception model
which has been trained on imagenet dataset that had 1000

14
different classes to classify. We can directly import this model from
the keras.applications . Make sure you are connected to the internet
as the weights get automatically downloaded. Since the Xception
model was originally built for imagenet, we will do little changes
for integrating with our model. One thing to notice is that the
Xception model takes 299*299*3 image size as input. We will
remove the last classification layer and get the 2048 feature vector.
model = Xception( include_top=False, pooling=’avg’ ) .The
function extract_features() will extract features for all images and
we will map image names with their respective feature array. Then
we will dump the features dictionary into a “features.p” pickle file.

5.4 LOADING DATASET FOR TRAINING THE MODEL

In Flickr8ktest folder, we have Flickr8k.trainImages.txt file that


contains a list of 6000 image names that we will use for training.For
loading the training dataset, we need more functions:

• Load photos– This will load the text file in a string and will
return the list of image names.

• Load clean descriptions – This function will create a dictionary


that contains captions for each photo from the list of photos. We
also append the <start> and <end> identifier for each caption.
We need this so that our LSTM model can identify the starting
and ending of the caption.

• Load features – This function will give us the dictionary for


image names and their feature vector which we have previously
extracted from the Xception model.
15
5.5 TOKENIZING THE VOCABULARY

Computers don’t understand English words, for computers, we will


have to represent them with numbers. So, we will map each word of
the vocabulary with a unique index value. Keras library provides us
with the tokenizer function that we will use to create tokens from
our vocabulary and save them to a “tokenizer.p” pickle file.

5.6 CREATE DATA GENERATOR

Here we need to train our model on 6000 images and each image
will contain 2048 length feature vector and caption is also
represented as numbers. This amount of data for 6000 images is not
possible to hold into memory so we will be using a generator method
that will yield batches.

X1(FEATURE) X2(TEXT) Y(WORD TO


PREDICT)
FEATURE START TWO
FEATURE START,TWO DOGS
FEATURE START,TWO,DOGS DRINK
FEATURE START,TWO,DOGS,DRINKS WATER
FEATURE END
START,TWO,DOGS,DRINKS,WATER

16
CHAPTER 6

PERFORMANCE EVALUATION

6.1 TESTING THE MODEL

The model has been trained, now, we will make a separate file
testing_caption_generator.py which will load the model and
generate predictions. The predictions contain the max length of
index values so we will use the same tokenizer.p pickle file to get
the words from their index values. we encode all the test images and
save them in the file “encoded_test_images.pkl”.

6.2 RESULT

The above project was done and description for image came
successfully.

17
CHAPTER 7

CONCLUSION AND FUTURE ENHANCEMENT

7.1 CONCLUSION

We have implemented a CNN-RNN model by building an image


caption generator. Some key points to note are that our model
depends on the data, so, it cannot predict the words that are out of its
vocabulary. This model is capable to autonomously view an image
and generate a reasonable description in natural language with
reasonable accuracy and naturalness. There is still work going on
for alternating Pre-Trained Photo Models to improve the feature
extraction of the model. And also planning to improve and achieve
better performance by using word vectors on a much larger corpus
of data.

7.2 FUTURE ENHANCEMENT

Future work is to develop segmentation-based visual explanation methods


and to compare them with state-of-the-art approaches like Grad-Cam.
There is still work going on for re- ranking.

18
APPENDIX 1

SOURCE CODE

from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor,


AutoTokenizer
import torch
from PIL import Image
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-
image-captioning")
tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-
captioning")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
images = []
for image_path in image_paths:
i_image = Image.open(image_path)
if i_image.mode != "RGB":
i_image = i_image.convert(mode="RGB")
images.append(i_image)
pixel_values=feature_extractor(images=images,
return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
preds = [pred.strip() for pred in preds]
return preds
predict_step(['sample2.jpg']) # ['a woman in a hospital bed with a woman in a
hospital bed']

19
APPENDIX 2
SCREENSHOT

20
REFERENCES

1. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice:


semantic propositional image caption evaluation

2. Wu Y, et al. (2016) Google’s neural machine translation system:


Bridging the gap between human and machine translation.

3. Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell :


neural image caption generator.

4. Schmidhuber J (2015) Deep learning in neural networks: An


overview of Neural Networks.

5. Hodosh M, Young P, Hockenmaier J (2013) Framing image


description.

6. Biswas R (2019) Diverse image caption generation and


automated human judgement through active learning.

7. Dudley JJ, Kristensson PO (2018) A review of user interface


design for interactive machine learning.

8. Gunning D, Aha D (2019) DARPA’s explainable artificial


intelligence (XAI) program.
21

You might also like