Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam

Image caption generation using Deep Learning
By
Sourav Kumar Saikia - (17-16-022)
Bishal Nath - (17-16-049)
Bibhuti Baishya - (17-16-052)
Under the guidance of

Dr. Manas Kumar Bera
In Partial Fulfillment of the Fourth year UG Project Work

Department of Electronics & Instrumentation Engineering
NIT Silchar, Assam 1
Table of Contents
1. Introduction
2. Literature Survey/Background Study
3. Motivation and Objective
4. Methodology
5. Work Done So Far
6. Work To Be Done.
7. Time frame of the project
8. References
2
Introduction
• The problem of generating natural language descriptions from visual data has long been
studied in computer vision, but mainly for video.(eg. Subtitles for YouTube videos)
• Leveraging recent advances in Computer Vision, their attributes and locations, allows
us to drive natural language generation systems to generate description for a given
image.
• We propose a neural and probabilistic framework to generate descriptions from images.
• The generative model is based on Deep Learning that combines recent advances in
Computer Vision & Natural Language Processing to generate english sentence
describing the image.
3
Literature Survey/Background Study
• We have studied the following journals.
Name Author Knowledge gained
Show and Tell: A Neural Image Oriol Vinyals, Alexander Toshev, Understanding the structure of
Caption Generator Samy Bengio, Dumitru Erhan image caption generation
model
Accelerating Very Deep Xiangyu Zhang, Jianhua Zou, Understanding of Convolutional
Convolutional Networks for Kaiming He, and Jian Sun Neural Networks
Classification and Detection
What is the Role of Recurrent Marc Tanti , Albert Gatt,Kenneth Understanding of different
Neural Networks(RNNs) in an P. Camilleri architectures that can be used
Image Caption Generator for caption generation
Fundamentals of Recurrent Neural Alex Sherstinsky Basic idea about RNN and
Network (RNN) and Long Short- LSTM
Term Memory (LSTM) Network
Context based text-generation Sivasurya Santhanam Understanding the text
using LSTM networks generation model with LSTM
Table 1 : Table containing the details about the journals we studied 4

Literature Survey/Background Study
• We have also accessed the following websites in order to gain basic understanding of neural
networks.
Website Author Knowledge gained
https://towardsdatascience.com/a- Sumit Saha Understanding of Convolutional
comprehensive-guide-to- Neural Networks
convolutional-neural-networks-the-
eli5-way-3bd2b1164a53
https://keras.io/examples/vision/i Keras Documentation Image classification example
mage_classification_from_scratc using Keras
h/
https://neurohive.io/en/popular- Muneeb ul Hassan Understanding the CNN model
networks/vgg16/ named VGG16 used for image
classification
http://introtodeeplearning.com/ MIT Understanding of Neural
Networks.
Table 2 : Table containing the details about the websites we have visited
5
Motivation and Objective
• Caption generation is a challenging artificial intelligence problem where a textual

description must be generated for a given photograph.
• Constructing proper English sentences will have a great impact by helping visually
impaired people in better understanding today's digital world.
• These methods will result in a single end-to-end model which will defined to predict a
caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of
specifically designed models.
• To prepare photo and text data for training a deep learning model.
• To design and train a deep learning caption generation model.
• To evaluate a trained caption generation model and use it to caption entirely new
photographs.
6
Methodology
• We have selected a dataset called “Flickr8k” which consists of 8000 images with 5
different captions.
• The image data is preprocessed using keras API and feed into VGG16 to extract features
and store them locally.
Fig 1: The complete architecture of VGG16 as in the paper “Very Deep Convolutional
Networks for Large-Scale Image Recognition”
7
Methodology
• The text data is cleaned to get rid of punctuations, numbers as well as articles.
• The dataset is divided into training and test set for further processing.
• We will feed the text data into an LSTM and extract the text features.
Fig 2: Basic representation of RNN and Feed-Forward Neural Network
8
Methodology
Fig 3: Vanilla RNN structure Fig 4: Structure of LSTM
9
Methodology
• Finally both the image and text data is feed into another multimodal network to generate
the final caption.
Fig. 5 Project Architectute

10
Methodology
• We will evaluate the performance of the model using BLEU(Bilingual Evaluation

Understudy) score.
• Improving the model by changing different parameters(e.g. Dropout, Optimizer,
Activation Function,number of dense layers)
11
12
Results
13
Results
14
Results
15
Results
BATCH SIZE No of Epochs Image Classifier LSTM BLEU-1
32 7 VGG16 128 0.472
64 13 VGG16 256 0.573
64 7 VGG16 256 0.591
64 8 VGG16 256 0.578
16
Softwares used
• Language : Python 3.6

• Software used : Spyder
• Libraries used : Keras with Tensorflow backend
• Dataset used : Flickr8k (A collection of 8000 images labelled with 5 captions),
17
Proposed Improvements
• Increase the amount of input data, use dataset like Flickr30K and check the output
• Change the CNN from VGG16 to InceptionV3 and try to improve the BLEU Score
further.
• Change the dropout amount,Activation Function etc. and check whether the BLEU score
can be improved or not.
18
Timeframe of the project
❖ Started ideation of this project in September 2020 and planning to complete it by April
2021.The detailed timeframe is explained in the timeline below.
Study of research papers and download of Coding and debugging and training the
software and other datasets. model with different parameters
November’2020 April’2021
September’2020 December’2020
Preprocessing the Image and text data Optimization of the model and
preparation for the conference paper
19
References
• Journal:
• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan . “Show and Tell: A Neural Image
Caption Generator” , No of pages - 9 , 20 April,2015
• Xiangyu Zhang, Jianhua Zou, Kaiming He† , and Jian Sun . “Accelerating Very Deep
Convolutional Networks for Classification and Detection”, No of pages - 14, 18 November, 2015.
• Marc Tanti , Albert Gatt,Kenneth P. Camilleri . “What is the Role of Recurrent Neural
Networks(RNNs) in an Image Caption Generator”, No of pages - 10.25 August,2017
• Alex Sherstinsky . “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term
Memory (LSTM) Network” , No of pages - 43, 31 May, 2020
• Sivasurya Santhanam . “Context based text-generation using LSTM networks” , No o pages-10,
30 April,2020
• Dataset Used:
• Flickr8k . https://www.kaggle.com/shadabhussain/flickr8k
• Flicker30K, https://www.kaggle.com/hsankesara/flickr-image-dataset
20
References
• Webpage:
• Muneeb ul Hassan , VGG16 - Convolutional Network for Classification and Detection, neurohive, 20
November, 2018 , https://neurohive.io/en/popular-networks/vgg16/ . [Accessed - 25-10-2020 ]
• Sumit Saha, A Comprehensive guide to Convolutional Neural Networks - the ELI5 way ,
towardsdatascience, 15 December,2018 ,
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3b
d2b1164a53
, [Accessed - 01-11-2020]
• MIT, Introduction to Deep Learning, http://introtodeeplearning.com/ , [Accessed – 01-11-2020]
• M. Jishan, “ImageToText: Image Caption Generation Using Hybrid Recurrent Neural Network”
,academia.edu,
https://www.academia.edu/40860840/ImageToText_Image_Caption_Generation_Using_Hybrid_Recurrent
_Neural_Network
.[Accessed - 15-11-2020]
• Pragati Baheti, “Introduction to Multimodal Deep Learning”. heartbeat.fritz.ai,
https://heartbeat.fritz.ai/introduction-to-multimodal-deep-learning-630b259f9291 ,[Accessed - 17-11-
2020]
• Richard Wanjohi, “Time Series Forecasting using LSTM in R”, rwanjohi.rbind.io,
https://rwanjohi.rbind.io/2018/04/05/time-series-forecasting-using-lstm-in-r/ .[Accessed - 22-11-2020]
21

Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam

Uploaded by

Copyright:

Available Formats

Image caption generation using Deep Learning

Under the guidance of

In Partial Fulfillment of the Fourth year UG Project Work

Table 1 : Table containing the details about the journals we studied 4

• Caption generation is a challenging artificial intelligence problem where a textual

Fig 2: Basic representation of RNN and Feed-Forward Neural Network

Fig 3: Vanilla RNN structure Fig 4: Structure of LSTM

Fig. 5 Project Architectute

• We will evaluate the performance of the model using BLEU(Bilingual Evaluation

32 7 VGG16 128 0.472

64 13 VGG16 256 0.573

64 7 VGG16 256 0.591

64 8 VGG16 256 0.578

• Language : Python 3.6

You might also like