You are on page 1of 5

2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)

Amity University, Noida, India. June 4-5, 2020

Generating Image Captions based on Deep Learning


and Natural language Processing
Smriti Sehgal Jyoti Sharma Natasha Chaudhary
Assistant Prof. Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering
Dept. of Computer Science and Engineering ASET, Amity University, Noida ASET, Amity University, Noida
ASET, Amity University, Noida jyoti03sharma98@gmail.com natasha.chaudhary660@gmail.com
smriti1486@gmail.com

Abstract—This model enables an individual to input an III. WORKFLOW


image and output a description for the same. The research
paper makes use of the functionalities of Deep Learning and
NLP (Natural Language Processing). Image Caption
Generation is an important task as it allows us automate the
task of generating captions for any image. This functionality
enables us to easily organize files without paying heed to the
task of captioning. It is also important for making dynamic
web pages. This paper is for people who are visually impaired
or suffer from short sightedness. So, rather than looking at an
image with trouble they can easily read the caption generated
by this model in a larger format. It can also be used to give
description of a video in real time on later implementation for Fig. 1. Basic Workflow of the model[8]
a video.
The two deep learning algorithms used here are
Keywords—DeepLearning; Convolutional neural network; convolutional neural network and recurrent neural network.
filters;recurrent neural network; natural language processing;
LSTM Firstly, the input image is passed through the
Convolutional Neural Network (CNN) to identify the objects
I. INTRODUCTION and scenes present in the image. Also, use of transfer
Image Caption Generation is based on the functionalities learning is done here for the pre-processed model. In CNN
of CNN and RNN. Use of Keras Library has been made and various concepts such as pooling, padding, filters etc. are
the development has been done in Jupyter Notebook. The used. A set of words/objects will be generated as the output
implementation for this paper has been done using Python from the CNN model. Next, use of Natural Language
Language. This paper is going to make use of COCO dataset. Processing (NLP) is done which helps us to communicate
COCO (Common Objects and Contexts) data set is a with the computer.
specialized dataset which contains 1 lakh images with each Finally Recurrent Neural Network is trained using the
containing 5 captions associated with each. This is one of the Flickr8k_text dataset. The objects detected are passed to the
apt dataset for our model and allows to develop a well RNN after some desired processing and the RNN produces
trained model after rigorous training. some desired meaningful caption. Refer the image to
CNN is majorly used to extract objects and other spatial understand the workflow more clearly.
information patterns from our input image whereas RNN
works efficiently with any kind of sequential data fed to it. IV. DEEP LEARNING
CNN is often called the encoder whereas RNN is associated Deep Learning can be defined as an advancement of
with the decoder. machine Learning which is based on Artificial Neural
Networks and contains more than one hidden Layer. Deep in
Natural Language Processing is also being used here to deep Learning itself is used to denote the number of layers in
generate image captions alongside CNN. LSTM are the the Network.
specialized Recurrent Neural Networks which allow for
information to persist. Use of the VGG16 model which is pre It is a technique in which computers learn to perform
– trained on the ImageNet dataset for the classification of tasks that can just be performed by humans. Deep Learning
images has been made. algorithms are capable of giving results that can perform
better than humans. There are many algorithms such as
II. PLATFORM USED convolutional neural network, recurrent Neural Network etc
In 2014, Fernando Pérez announced a spin-off project that are applied in different applications like speech
from IPython called project Jupyter. It is a platform which is recognition, video recognition, NLP etc
open source web application in nature and is used to develop In deep Learning very large sets of labelled data are
and create code to implement models in an effective manner. required for training models containing many hidden layers.
Enables us to incorporate live code, visualize and narrative Therefore, large data sets and high computing power are two
texts. The implementation of the model is being done on most important requirements of deep learning.
Jupyter Notebook.

978-1-7281-7016-9/20/$31.00 ©2020 IEEE 165

Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
V. NEURAL NETWORKS The input given is larger than the filter. Dot product
Human beings are blessed with the ability to think and between a patch of the input image that is of the same size of
memorize. Artificial Intelligence tries to mimic this the filter and the filter itself is applied. This gives a single
behaviour. And this is what forms the basis for Neural value If we multiply the filter with array of input images we
Networks.They can be considered as a series of algorithms get a single value but if the same operation is repeated
which are trying to mimic the functionalities of a human multiple times we get a 2-D array of filtered values and this
brain. The human brain is based on networks of neurons is called as feature map.
which act as messengers of signals to the brain. Similarly, in 2) Padding
neural networks there are several layers. They try to find out Padding is used to protect the length and the breadth of
the underlying relationships in the input data. It contains the input image when going from one layer to the next.
different layers. Each node in a neural network is called a Deeper network can be designed with the help of padding.
perceptron. A perceptron is a single neuron model that was a The performance is said to be improved here as the
precursor to larger neural networks. It feeds the signal information is kept at the borders. Padding is of two types,
produced by a multiple linear regression into an activation valid padding and same padding. In Valid Padding the
function that may be nonlinear. feature that is convolved through the layers is smaller than
the input image. In same padding the feature that is
VI. CONVOLUTIONAL NEURAL NETWORK
convolved through the layers is either of same size or bigger
size than the input image.
3) Activation function
A node that is kept in the middle or at the finish point of
a neural network is called the activation function. An
activation function helps us to identify whether to fire the
neuron or not. It is a non-linear function that is applied on
the input signal which is then transformed and sent to the
Fig. 2. Convolutional Neural Network[8]
further layers of neurons which then treat it as input. We
Convolutional neural network also known as CNN or have used RELU and softmax activation function.
ConvNet is a kind of artificial neural network which is used 4) Stride
by most people for the analysis of images. It can be also used The number of pixels of input matrix over which filter is
for problems of classification and data analysis. CNN can be moved is termed as stride. If stride is one, the pixel is moved
thought of as an Artificial Neural Network that uses one time. Similarly, if stride is two or three times the pixel is
classification for detecting patterns and tries to find a moved one or two times. Thus, stride decides how much our
meaning from them. filter slides over the input.
More formally, CNN is an algorithm of deep learning B. Layers in Convolutional Neural Network
which allow user to input an image, and the algorithm can
assign learnable biases and weights to various objects or Input image in convolutional neural network goes
aspects that are present in the image and hence be able to through different layers. Each layer is responsible for its own
differentiate one object from the other. functions that are assigned to them. All layers are discussed
as follows -
A. Important concepts in Convolutional neural network
1) Input Layer
1) Filters The very first layer present in the convolutional neural
network is the input layer. The input layer contains artificial
input neurons. This layer is responsible for bringing the very
initial data into the system. The input layer consists of a
characteristic that the other layers do not have. Input layers
consists of neurons that are passive in nature. Being the first
layer it does not take information from previous layers. An
input layer contains neurons which do not contain weighed
inputs or we can say where the weights are calculated in a
Fig. 3. Filters[1]
way different than other layers as input is coming for the
very first time. In convolutional neural network the input
In Convolutional Neural Network every layer that is
layer can take any image of 3 dimensions. The image can be
present contains filters. Filters are responsible for finding
coloured or black and white.
specific patterns or features in the data as input is given. In
each layer the number of filters that the layer should have are 2) Convolutional Layer
given. The convolutional layer is the most important layer as
At the start of the network filters are pretty simple and most of the computation takes place in this layer. It contains
detect patterns like edges, circles, etc. but as we move to many hidden layers. The purpose of this Convolutional
further layers these filters become more advanced and are operation is to extract features as well as minute details from
able to identify complete figures like mouse, cat etc. Filters the input image. Convolutional neural network is named
can be imagined to be a matrix of specified number of rows after this operation. There can be more than one
and columns. Any random number can be used to initialize convolutional layer in CNN. This layer is responsible for
the matrix blocks. detecting patterns and objects from the input image. It
contains specified number of filters which are applied to the

166

Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
input image. Each object that is given in the input image is C. Work Flow of CNN
identified using filters in complex mathematical steps to
reach an individual value in the output map.
In Convolutional Layer the filter is slid throughout the
whole image and a dot product of the part of the image that
is overlapped by the filter and the filter itself is performed to
give a scaler product which is stored in a matrix. This output
matrix helps us to detect patterns in an image. This output is
then given as an input to the further convolutional layers for
the detection of more complex patterns.
3) Pooling Layer
One of the most commonly known limitation of feature
map is that they capture the positions of all the features of Fig. 4. Workflow of CNN[1]
the input very precisely. So, if there are small variations in
the input image we get a different feature map. These All concepts required in convolutional neural network
variations can be possible due to various different reasons have been explained above. Firstly, an input is given to our
such as rotating, cropping, shifting etc. network through the input layer, the input should be a 3-
Pooling layers are responsible for breaking the features dimensional image. It could be either coloured or black and
present into patches of feature map. Hence, we have an white.
image of lower resolution but containing the major elements The image is then passed further to the convolutional
that are important ignoring all the minute useless details. layer. In this layer, filters are applied on the input image.
This technique can be referred as down sampling. Pooling Number of filters are specified This process is called
layer is used between two convolutional layers. The two convolution. Then activation function is applied. RELU
methods of pooling are - activation function is used here. Then our output here acts as
an input to the pooling layer which helps to decrease the
x Average pooling – The average of all the blocks of
resolution of the image. Then again, the convolutional layer
the matrix that are overlapped by the filter are given
undergoes the same process. There can be more than one
by average pooling. Average pooling helps in
convolutional layer. As we go in further convolutional
reducing the dimensions.
layers the filters get more complex patters such as eyes,
x Max pooling – The highest value of the input figure faces, birds etc. Next, the matrix is flattened and the fully
that is overlapped by the filter is given by Max connected layer is entered. This layer helps to classify the
Pooling. Another function of max pooling is objects in input image into classes. Softmax function is used
removing the noisy activations and hence supressing here to find the probability of classes. Finally, the output is
the noise in the image along with the reduction of obtained which consists of the objects present in the image in
dimensions. Performance of max Pooling is better the output layer.
than the performance of average pooling.
VII. NATURAL LANGUAGE PROCESSING
4) Fully Connected Layer Natural Language Processing abbreviated as NLP is a
This is mainly present at the ending of the convolutional stream of artificial intelligence which helps to communicate
network and is also known as the feed forward layer or FC with computers. Due to NLP, computers are now able to
layer. Fully connected layer is a dense layer that is each node fulfil various tasks which were not possible before such as
in this layer is connected to each and every node that is they are now able to read test, hear and understand what we
present in the previous layer. Fully connected Layer is are trying to say and even interpret the parts that are
responsible for the full classification of the features obtained important. The various places where NLP can be used is to
by us from the previous layers such as the pooling or translate text from one language to another. It can also be
convolutional layer. used in applications such as Grammarly etc to correct any
The input that is given to the fully connected layer is the grammatical mistakes, Call centres to respond to various
output that is received from the from the final convolutional customers. Personal assistant applications such as alexa,
layer or the final pooling layer. The output is flattened. sirietc use NLP to operate.
Flattening means converting a matrix into a vector before it In our Research paper of image captioning, natural
enters a the fully connected layer. After the image passes the language processing is used after the algorithm CNN. Using
fully connected layer, the softmax activation function is used NLP we apply algorithms due to which we can identify the
and not the ReLU activation function to obtain the natural language rules so that our language can be converted
probability of input being present in some definite class. to a form that can be understood by the computers.
5) Output Layer Thus, NLP here is used to help us communicate with the
The outputs received are obtained from the fully computer.
connected layer are calculated on the basis of highest
probability obtained by the FC Layer. The outputs are A. Techniques involved –
objects that are present in our input image such as dog, cat, The two main techniques used in NLP are semantic
mouse etc. analysis and syntactic analysis.

167

Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
1) Syntax Analysis: layer as an input for the next layer. Here there are 4 neural
Identification is done whether the grammatical rules are network layers each interacting in a unique manner.
being followed by the language or not. Some of the syntax
techniques frequently used are Parsing, Stemming,
Lemmatization
2) Semantic Analysis
Here, algorithms are applied in order to understand the
meaning that NLP is trying to convey. Some of the
techniques involved in this are Word sense Disambiguation,
Named Entity Recognition and Natural Language
Generation (NLG).
VIII. RECURRENT NEURAL NETWORK Fig. 6. Work flow of LSTM [10]

Human beings are able to develop an understanding of D. A Basic understanding of LSTMs


the world based on their memory and past experiences.
There is some persistence in thoughts which allow humans The main components of LSTM comprise of a conveyor
to use these as memories for reference later. Traditional belt like structure known as “the cell state”. It is a straight
neural networks are unable to store past experiences. This is horizontal line which runs across the entire structure chain
one of the major advantages of Recurrent Neural Networks and has minimal interactions. Removal of unwanted or
and why they are used. unnecessary information from the cell state or addition new
information is done using “Gates”. The output that is
RNN consists of loops in them. A loop helps in passing received from the sigmoid layer is in the form of either 0 or
information from one step to the next in the model. RNN 1. Where 1 represents “Permitting all information to pass
consists of several layers where each layer passes through” and 0 represents “Dumping all information from
information to the next layer and allows for information to the cell state”. There are 3 gates to protect and control the
persist. The functionality of RNN is based on the following cell state. The very first layer that the input data goes
example. If we have a sentence like – ‘The river contains’ through in LSTM is the “forget gate layer.” In this layer
..........Then the model should be able to predict the word decision on which part of data do we want to discard is taken
‘water’ on its own. and it is done by the sigmoid layer.
A. Work flow of RNN

Fig. 7. Forget gate layer [10]

Fig. 5. Workflow of RNN [10] The second layer that the input data goes through in
LSTM is the “input gate layer”. In this layer, we decide
In the case of RNN, the persisting information is used in which new information is to be added in the cell state. Here
conjunction with the input data for that layer and the output there are two components of this layer. The first is the
is predicted. Based on the above figure, information from sigmoid function and the second is the tan (h) layer.
timestamp ‘t-1’ is passed along with the input value at
timestamp ‘t’ to predict the output at timestamp t. Back
propagation through time (BTT) is the algorithm used in
order to train the model. Thus, RNNs functionality is
required whenever we wish to implement the model while
using persistence.
B. Long Short Term Memory (LSTM)
LSTM is a type of Recurrent Neural Network which is
used in cases where memory persistence for a longer period
of time is required. It is often considered as the more
enhanced version of RNN. It can perform almost every task
that an RNN network can perform. They can understand long Fig. 8. Input gate layer [10]
– term dependencies. LSTMs are designed in order to tackle
the issue of long term dependency. IX. APPLICATIONS OF IMAGE CAPTIONING
C. Work Flow of LSTM x Helps in automating the task of labelling images.
LSTMs consist of numerous layers where each of these x Can be useful for social media sites such as
layers is in the form of chain of repeating modules. The basic Facebook which can make use of these
idea behind LSTM is that each layer stores the output of its

168

Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.
automatically generated captions to infer data about 2018), 36 pages. Computing methodologies→Machine learning;
its users solely on the basis of images. Neural networks.
[2] “A gentle Introduction to deep learning Caption Generation Models”,
x In web development to make websites dynamic by Jason Brownlee, November 22 2017, For deep learning Natural
simply add images without worrying about Language Processing.
describing them. [3] O.Vinyals, A. Toshev, S. Bengio, D. Erhan, “Show and Tell: Lessons
learned from the 2015 MSCOCO Image Captioning Challenge”,
x Can be of aid to visually impaired people who can IEEE transactions on Pattern Analysis and Machine Intelligence,
simply read the text in a much larger font. 2016.
[4] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert
x Helps in organizing files without dwelling in the task English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard
of image captioning present in the files. Socher. Ask me anything: Dynamic memory networks for natural
language processing. CoRR, abs/1506.07285, 2015
x Google photos can be helped in organizing images [5] Quanzeng You, HailinJin, Zhaowen Wang, Chen Fang, and Jiebo
based on the objects present in the images. Luo. Image captioning with semantic attention. CoRR,
abs/1603.03925, 2016.
X. CONCLUSION [6] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter
Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth.
Image Caption Generator is thus an effective tool which Every picture tells a story: Generating sentences from images. In
can be used for multiple purposes. It helps in automating the Proceedings of the 11th European Conference on Computer Vision:
task of generating labels for images in an organized manner Part IV, ECCV’10, pages 15–29, Berlin, Heidelberg, 2010. Springer-
and thus aids in making lives easier. It can be used to Verlag.
describe images to people who are blind or have low vision [7] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network
architectures for matching natural language sentences, in:
and who rely on sounds and texts to describe a scene. In web Proceedings of the 27th International Conference on Neural
development, it’s a good practice to provide a description for Information Processing Systems, 2014, pp. 2042–2050
any image that appears on the page so that an image can be [8] M. Malinowski, M. Rohrbach, M. Fritz, Ask your neurons:a neural
read or heard as opposed to just seen. based approach to answering questions about images, in: International
Conference on Computer Vision, 2015
REFERENCES [9] X. Chen, C. Lawrence Zitnick, "Mind's eye: A recurrent visual
[1] Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and representation for image caption generation", CVPR, 2015
Hamid Laga. 2018. A Comprehensive Survey of Deep Learning for [10] Understanding LSTM Networks by Colah,online blog , 2015
Image Captioning. ACM Comput. Surv. 0, 0, Article 0 (October

169

Authorized licensed use limited to: Heriot-Watt University. Downloaded on September 20,2020 at 19:21:18 UTC from IEEE Xplore. Restrictions apply.

You might also like