You are on page 1of 17

Automated Image Captioning using CNN and RNN

J COMPONENT PROJECT REPORT

Submitted by
Niyati Sahni (17BCE0506).
Prishita Ray (17BCE2405).
NALLAMILLI YOGESWARASRIRAMAREDDY (17BCE0524).

in partial fulfillment for the award of the degree of

B. Tech
in
Computer Science and Engineering

Vellore-632014, Tamil Nadu, India

School of Computer Science and Engineering


October, 2020
ABSTRACT: -

With the evolution of technology image captioning is a very important aspect of almost all
industries involving information abstraction. To interpret such information by a system is
very complex and time-consuming. For a machine to understand the context and environment
details of an image, it needs a higher understanding of the description projected from the
image. Many deep learning methods have not adopted traditional methods but are changing
the way a system understands and interprets. Majorly using Captions and achieving a well-
defined vocabulary linked to images.

With the advancements in technology and ease of computation of extensive data has made it
possible for us to easily apply deep learning in several projects using our personal
workstation. A solution requires both that the content of the image is understood and
translated to meaning in the terms of words, and that the words must string together to be
comprehensible. It combines both computer vision using deep learning and natural language
processing and marks a truly challenging problem in broader artificial intelligence.

In this project, we create an automated image captioning model using Convolutional Neural
Networks (CNN) and Recurrent Neural Networks (RNN) to produce a series of texts that best
describe the image. Using Flickr 8000 dataset, we have prepared our model. As the image
captioning requires a neural network here we have properly defined steps to perform. To
create a deep neural network using CNN and RNN, We first link the description to the image
convolutional neural network will take the image and segregate it into a number of
characteristics, the recurrent neural network will make this characteristic into well-defined
descriptive language.

This project uses the encoder-decoder model, in which the CNN which performs the feature
extraction and the output of the encoder is fed into to the decoder which processes the
classified features into appropriate sentences. The feature extraction is done by the latest
Inception V3 module-50 technology with means of transfer learning so that we can modify
the project- specific to our purpose. The language model uses natural language toolkit for
simple natural language processing and the architecture used for the recurrent neural
networks is long-short term memory.
I.INTRODUCTION: -
Our project aims at creating an auto captioning model for an image. This can be achieved by
using CNN for image feature classification. Then we are using RNN for generating the
caption. We have chosen Long Short-Term Memory (LSTM) which is a type of RNN model.
The LSTM can learn from sequential data like a series of words and characters. These
systems utilize hidden layers that connect the input and output layers. Which makes a
memory circle with the goal that the model can learn from the past yields. So basically, CNN
works with spatial features and RNN helps in solving sequential data.

This project uses advanced methods of computer vision using Deep Learning and natural
language processing using a Recurrent Neural Network. Deep Learning is a machine learning
technique with which we can program the computer to learn complex models and understand
patterns in a large dataset. The combination of increasing computation speed, wealth of
research and the rapid growth of technology. Deep Learning and AI is experiencing massive
growth worldwide and will perhaps be one of the world’s biggest industries in the near future.
The 21st century is the birth of AI revolution, and data becoming the new ’oil’ for it. Every
second in today’s world large amounts of data is being generated. We need to build models
that can study these datasets and come up with patterns or find solution for analysis and
research. This can be achieved solely due to deep learning.

Computer Vision is a cutting-edge field of computer science that aims to enable computers to
understand what is being seen in an image. Computers don’t perceive the world like humans
do. For them the perception is just sets of raw numbers and because of several limitations like
type of camera, lighting conditions, clarity, scaling, viewpoint variation etc. make computer
vision so hard to process as it is very tough to build a robust model that can work on every
condition.

The neural network architectures normally we see were trained using the current inputs only.
While developing the system, the generating output does not consider the previous inputs. It
is because of neglecting any memory elements present. That is why the use of RNN tackles
the memory issues that haunt the system. This led us to create an efficient system.

Motivation of the project: -


Based on how humans are able to perceive and analyze objects around them, teaching
machines to do the same is a fairly challenging task, but becomes necessary in many AI
applications, such as medical imaging, anomaly detection, and other identification tasks. For
this purpose, we have decided to create a system that can simplify this task and work as
accurately in identifying and captioning images as humans can. The advantage of designing
such a system is that machines can caption a huge dataset of images and augment this
capability of human beings when faced with such a great amount of data. Analysis can further
help in drawing conclusions or predicting future outcomes.

II.Literature Survey
Paper Algorithm Used Data Set Performance Existing Proposed
Description Used Measured System System
Image Convolutional COCO BLEU-1 Previous Our model
Captioning Neural Networks Dataset Score 74% works are consists of
with Object (CNNs) and LSTM aligning in two sub-
Detection and (Long Short-Term the decoding models: an
Localization Memory) Networks step, they object
as RNNS usually use detection and
recurrent localization
neural model, which
networks extract the
(RNN) based information
on long of objects and
short-term their spatial
memory relationship
(LSTM) in images
units as a respectively:
decoder. As Besides, a
for the deep
encoding recurrent
step is neural
concern, the network
work is (RNN) based
divided in to on long short-
two major term memory
classes: CNN (LSTM) units
RNN models with attention
and mechanism
attention- for sentences
based generation.
models.
Attention
based model
like allowed
their model
to attend any
visual parts
of the input
image but
most of the
subregions it
attends are
meaningless
Image Convolutional MSCOCO BLEU Score, The neural Our model to
Captioning Neural Network Dataset 40%Accurac network caption
using Deep based on ResNet and y architecture images are
Learning VGG-18 which built on
architectures along consists of multimodal
with Recurrent 650,000 recurrent and
Neural Networks neurons with convolutional
(RNNS) 60 million neural

parameters networks. A
contains five Convolutional
convolutional Neural
layers. Some Network is
of these used to
layers were extract the
followed by features from
max-pooling an image
layers and which is then
three fully along with the
connected captions is
layers with a fed into a
final 1000- Recurrent
way SoftMax Neural
layer. Network
Improving Convolution Neural MS- Blue Score In this, the Our model to
Image Network(CNN).RNN COCO 40% accuracy use of caption
Captioning by and LSTM dataset knowledge images are
Leveraging graphs that built on
Knowledge capture multimodal
Graphs general or recurrent and
common- convolution
sense neural
knowledge, networks. A
to augment convolutional
the Neural
information Network is
extracted used to
from images extract the
by the state- features from
of-the art an image
methods for which is then
image along with the
captioning is captions is
explored. fed into a
The merit of Recurrent
this neural
methodology Network.
is that it is
using
background
knowledge as
a graph,
which results
in good
captions and
accuracy get
increased.
Convolutional Back Propagation MS- Blue Score In this a Our model to
Image through time COCO 46% accuracy Pixel CNN caption
Captioning (BPTT), LSTM, dataset architecture images are
RNN. for built on
conditional multimodal
image recurrent and
generation convolution
that neural
approximates networks. A
an RNN is convolutional
proposed. Neural
This paper Network is
demonstrates used to
that extract the
convolutional features from
architectures an image
with which is the
attention along with the
achieve the captions fed
state of the into a
art Recurrent
performance neural
on machine Network
translation
tasks

III.Methodology
Now, to create a deep neural network using CNN and RNN. We first link the description to
the image convolutional neural network will take the image and segregate it into a number of
characteristics, recurrent neural network will make this characteristic into well-defined
descriptive language.

Proposed model:

CNN Encoder
The encoder is based on a Convolutional neural network that encodes an image into a compact
representation. The CNN-Encoder is an Inception V3 module (Residual Network.one layer activation
has the use of skips collection that rely on the system. And where it gets inputted to another layer,
going even further into network, thus making the use of inception v3 module feasible.

RNN Decoder
The CNN encoder is followed by a recurrent neural network that generates a corresponding sentence.
The RNN-Decoder consists in a single LSTM layer followed by one fully-connected (linear) layer.
The RNN network is trained on the Flickr 8k dataset. It is used to predict the next word of a sentence
based on previous words. The captions are presented as a list of tokenized words so that the RNN
model can train and back propagate to reduce errors and generate better and more understandable
texts describing the image.
Flickr 8k dataset
This dataset contains 8,000 images where each image has 5 different descriptions based on
the features and context. Here they are well defined and used in various different
environments.

Inception V3 module:
There are 4 versions. The first Google Net must be the Inception-v1 , but there are numerous
typos in Inception-v3 which lead to wrong descriptions about Inception versions. These
maybe due to the intense ILSVRC competition at that moment. Consequently, there are many
reviews within the internet mixing up between v2 and v3. Some of the reviews even think
that v2 and v3 are an equivalent with just some minor different settings.

ARCHITECTURE: -
BACKGROUND STUDY: -

[1] A Survey on Automatic Image Caption Generation

Authors: Shuang Bai, Shan An

APA Reference: Bai, S., & An, S. (2018). A survey on automatic image caption generation.
Neurocomputing, 311, 291-304.

Image captioning approach robotically producing a caption for a photo. As a lately emerged
research place, its miles attracting more and more interest. To achieve the purpose of picture
captioning, semantic records of pictures desire to be captured and expressed in natural
languages.
Connecting both studies communities of computer vision and herbal language processing,
image captioning is a pretty difficult project. Various tactics have been proposed to remedy
this trouble. A survey on advances in picture captioning research is given. Based on the
method followed, the image captioning procedures are classified into different classes. The
architecture used in this is multimodal learning. The steps used in this article are as follows:
first, they implemented Deep Convolutional Neural Network for feature extraction; in the
next step, they are predicting words using an NL processed on image features. The second
proposition used is Encoder-Decoder based captioning, which is more complex and show
good results.
Representative strategies in each category are summarized, and their strengths and limitations
are mentioned. The advantage of this novel method is that this is not a retrieval-based
technique. Their methodology contains many disadvantages and is obvious. Since they
generate good grammatical and fluent words, but they already exist, and they don't evolve. In
some instances, they may produce irrelevant captions.

[2] Image Captioning with Object Detection and Localization

Authors: Zhongliang Yang, Yu-Jin Zhang, Sadaqat ur Rehman, Yongfeng Huang

APA Reference: Yang, Z., Zhang, Y. J., ur Rehman, S., & Huang, Y. (2017, September).
Image captioning with object detection and localization. In International Conference on
Image and Graphics (pp. 109-118). Springer, Cham.

Generation of natural language description automatically of an image is a task close to the


concept of image understanding. In this paper, a multi-model neural network method, much
related to the visual system of humans that automatically learns to state and describe the
content of images is presented. The model consists of two sub-models: object detection and
localization model, which extracts the object information and their spatial relationship in
images respectively; Also, a deep recurrent neural network (RNN) based on long short-term
memory
(LSTM) units with attention mechanism for sentences generation. The subtleties of the
proposed model, which comprises of two principal parts: encoding and decoding. The input
to the model is a picture/image I, while the yield is an expressive sentence S comprises of K
encoded-words. In the encoding part, right off the bat, this model recognizes objects in the
input picture followed by a deep CNN to separate their areas, which reflect the spatial
relationship related. All the data will be shown as a lot of highlight vectors alluded to as
annotation vectors. The encoding part creates L annotation vectors, every one of which is a
D- dimensional representation relating to an item and furthermore, its spatial area in the input
picture. In the decoding part, all these vectors are inputted into a Deep Recurrent Neural
Network model to produce a depiction caption. Each word of the description will be
robotically aligned to different objects of the inputted image when the generation is done.
This is much similar to the attention mechanism of the human visual system. Experimental
results on the Flickr 8k dataset showcase the merit of the proposed method, which
outperforms previous benchmark models.

[3] Improving Image Captioning by Leveraging Knowledge Graphs

Authors: Yimin Zhou, Yiwei Sun, Vasant Honavar

APA Reference: Zhou, Y., Sun, Y., & Honavar, V. (2019, January). Improving image
captioning by leveraging knowledge graphs. In 2019 IEEE Winter Conference on
Applications of Computer Vision (WACV) (pp. 283-293). IEEE.

In this, the use of knowledge graphs that capture general or common-sense knowledge, to
augment the information extracted from images by the state-of-the-art methods for image
captioning is explored. The merit of this methodology is that it is using background
knowledge as a graph, which results in good captions and accuracy get increased. They have
devised an architecture named CNet-NIC that is trained and processed to search and
recognize nine thousand objects. The RNN is used first to get a vector space embedding the
terms, and then CNN is used to get the features and finally to use the LSTM-RNN that will
be processed and trained to get the captions. In short, the steps have data processing,
extraction, detection, leverage of background knowledge, model training model, finally, the
testing and learning from the results.
The results of the experiments, on several benchmarks and standard data sets such as MS-
COCO, as measured by CIDEr-D, the performance metric methods for image captioning,
depict that the variants of the state-of-the-art methods for image captioning that make use of
the information extracted from knowledge graphs can substantially outperform those that rely
solely on the information extracted from images.
[4] Convolutional Image Captioning

Authors: Jyoti Aneja, Aditya Deshpande, Alexander G. Schwing


APA Reference: Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional image
captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 5561-5570).

In recent times significant progress has been made in image captioning, using Recurrent
Neural Networks powered by long-short-term-memory (LSTM) units. Despite decreasing the
vanishing gradient problem, and despite their compelling ability to memorize dependencies,
LSTM units are complex and inherently sequential across time. However, The complex
tending to and overwriting component joined with naturally consecutive processing, and
critical storage required because of backpropagation through time (BPTT), presents
difficulties during training. Despite the fact that the given RNNs based on LSTM/GRU gives
extraordinary results, e.g., for image captioning, the procedure of training is all but trivial.
For example, while the forward pass during training can be in parallel across samples, it is
inherently sequential in time, limiting the parallelism. To address this issue, a Pixel CNN
architecture for conditional image generation that approximates an RNN is proposed. This
article demonstrates that convolutional architectures with attention achieve the state of the art
performance on machine translation tasks.

IV.Experimentations and Results

DATASET: -
Flickr 8k dataset - An open-access labeled dataset that contains a total of 8092 images in
.JPEG format with different dimensions. It also contains 5 captions for every image in the
dataset.
HARDWARE AND SOFTWARE REQUIREMENTS:-
Hardware Requirements
 Processor: Minimum 1 GHz; Recommended 2GHz or more.
 Ethernet connection (LAN) OR a wireless adapter (Wi-Fi)
 Hard Drive: Minimum 32 GB; Recommended 64 GB or more.
 Memory (RAM): Minimum 1 GB; Recommended 4 GB or above.

Software Requirements
 Keras
 TensorFlow
 Jupyter Notebook
 Windows/Linux
Description of Full Implementation: -
We have used Flicker 8k dataset here. This dataset contains 8,000 images where each image
has 5 different descriptions based on the features and context. Here they are well defined and used in
various different environments. The dataset is the collection of images as well as caption. What we
intend here is to predict accurate captioning for a particular image. We have used 80% of the dataset
for the training of the model and 10% for the validation purpose and the remaining 10% for testing
purpose. To use any data for prediction, classification or analysis the first and foremost step is pre-
processing of the data.
Logical analysis with minimum one related work: -
The task of image captioning can be divided into two modules logically - one is an image-based
model - which extracts the features and nuances out of our image, and the other is a language-
based model which translates the features and objects given by our image-based model to a
natural sentence. For our image-based model (viz encoder) we usually rely on a Convolutional
Neural Network model. And for our language-based model (viz decoder) we rely on a Recurrent
Neural Network. Usually, a pretrained CNN extracts the features from our input image. The
feature vector is linearly transformed to have the same dimension as the input dimension of the
RNN/LSTM network. This network is trained as a language model on our feature vector.

RESULTS (SCREENSHOTS): -

We here obtained the similar output with different beam index ensuring there is only one
caption that is most suited with the image.
Arranged according to max probabilities the most suited caption for the given image is-
“Three men in white uniforms playing basketball.”
We here have obtained same captions when we apply beam indices 5 and 7 but a different result with
beam index 3.
Arranged according to max probabilities the most suited caption for the given image is-
“A person on a bike is coming up through the mud.”
We here have obtained different captions when we apply different beam indices.
Arranged according to max probabilities the most suited caption for the given image is-
“A snowboarder flies through the air after mid-air from a mountain”

We here have obtained same captions when we apply beam indices 3 and 5 but a different result with
beam index 7.
Arranged according to max probabilities the most suited caption for the given image is-
“A tan dog runs through the snow.”
We here have obtained same captions when we apply beam indices 5 and 7 but a different result with
beam index 3.
Arranged according to max probabilities the most suited caption for the given image is-
“A little girl in a red coat plays in snow.”

V.Conclusions
We have made an efficient system where any unique eco-system can be developed using our
system. Image captioning thus creates an illustrative description and has made the system smart.
An API call to our system where the API provides the image, and our system using the already
trained model generates a perfect automatic machine translation. The NLP based system is thus
capable of extracting features through which the model creates a system where image captioning is
accurate and context-driven.

For further precision, we can increase the number of levels of RNN which will increase the
knowledge and accuracy of the descriptions.
VI.REFERENCES
[1] Anderson, P., Fernando, B., Johnson, M. and Gould, S., 2016, October. Spice: Semantic
propositional image caption evaluation. In European Conference on Computer Vision
(pp. 382398). Springer, Cham.

[2] Aneja, J., Deshpande, A. and Schwing, A.G., 2018. Convolutional image captioning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
55615570).

[3] Bai, S. and An, S., 2018. A survey on automatic image caption generation.
Neurocomputing, 311, pp.291-304.

[4] Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W. and Chua, T.S., 2017. Sca-cnn:
Spatial and channel-wise attention in convolutional networks for image captioning.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
56595667).

[5] Gu, J., Wang, G., Cai, J. and Chen, T., 2017. An empirical study of language cnn for
image captioning. In Proceedings of the IEEE International Conference on Computer
Vision (pp. 12221231).

[6] Kinghorn, P., Zhang, L. and Shao, L., 2019. A hierarchical and regional deep learning
architecture for image description generation. Pattern Recognition Letters, 119, pp.77-85.

[7] Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z. and Yuille, A., 2014. Deep captioning
with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.

[8] Yang, Z., Zhang, Y.J., ur Rehman, S. and Huang, Y., 2017, September. Image captioning
with object detection and localization. In International Conference on Image and
Graphics (pp. 109-118). Springer, Cham.

[9] Vinyals, O., Toshev, A., Bengio, S. and Erhan, D., 2015. Show and tell: A neural image
caption generator. In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 3156-3164).

[10] Zhou, Y., Sun, Y. and Honavar, V., 2019, January. Improving image captioning by
leveraging knowledge graphs. In 2019 IEEE Winter Conference on Applications of
Computer Vision (WACV) (pp. 283-293). IEEE.

[11] Zhou, L., Xu, C., Koch, P. and Corso, J.J., 2016. Image caption generation with
textconditional semantic attention. arXiv preprint arXiv:1606.04621, 2.

You might also like