You are on page 1of 21

Image caption generation using Deep Learning

By
Sourav Kumar Saikia - (17-16-022)
Bishal Nath - (17-16-049)
Bibhuti Baishya - (17-16-052)

Under the guidance of


Dr. Manas Kumar Bera

In Partial Fulfillment of the Fourth year UG Project Work


Department of Electronics & Instrumentation Engineering
NIT Silchar, Assam 1
Table of Contents

1. Introduction
2. Literature Survey/Background Study
3. Motivation and Objective
4. Methodology
5. Work Done So Far
6. Work To Be Done.
7. Time frame of the project
8. References

2
Introduction

• The problem of generating natural language descriptions from visual data has long been
studied in computer vision, but mainly for video.(eg. Subtitles for YouTube videos)
• Leveraging recent advances in Computer Vision, their attributes and locations, allows
us to drive natural language generation systems to generate description for a given
image.
• We propose a neural and probabilistic framework to generate descriptions from images.
• The generative model is based on Deep Learning that combines recent advances in
Computer Vision & Natural Language Processing to generate english sentence
describing the image.

3
Literature Survey/Background Study
• We have studied the following journals.
Name Author Knowledge gained
Show and Tell: A Neural Image Oriol Vinyals, Alexander Toshev, Understanding the structure of
Caption Generator Samy Bengio, Dumitru Erhan image caption generation
model
Accelerating Very Deep Xiangyu Zhang, Jianhua Zou, Understanding of Convolutional
Convolutional Networks for Kaiming He, and Jian Sun Neural Networks
Classification and Detection
What is the Role of Recurrent Marc Tanti , Albert Gatt,Kenneth Understanding of different
Neural Networks(RNNs) in an P. Camilleri architectures that can be used
Image Caption Generator for caption generation
Fundamentals of Recurrent Neural Alex Sherstinsky Basic idea about RNN and
Network (RNN) and Long Short- LSTM
Term Memory (LSTM) Network
Context based text-generation Sivasurya Santhanam Understanding the text
using LSTM networks generation model with LSTM

Table 1 : Table containing the details about the journals we studied 4


Literature Survey/Background Study
• We have also accessed the following websites in order to gain basic understanding of neural
networks.
Website Author Knowledge gained
https://towardsdatascience.com/a- Sumit Saha Understanding of Convolutional
comprehensive-guide-to- Neural Networks
convolutional-neural-networks-the-
eli5-way-3bd2b1164a53
https://keras.io/examples/vision/i Keras Documentation Image classification example
mage_classification_from_scratc using Keras
h/
https://neurohive.io/en/popular- Muneeb ul Hassan Understanding the CNN model
networks/vgg16/ named VGG16 used for image
classification
http://introtodeeplearning.com/ MIT Understanding of Neural
Networks.
Table 2 : Table containing the details about the websites we have visited

5
Motivation and Objective

• Caption generation is a challenging artificial intelligence problem where a textual


description must be generated for a given photograph.
• Constructing proper English sentences will have a great impact by helping visually
impaired people in better understanding today's digital world.
• These methods will result in a single end-to-end model which will defined to predict a
caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of
specifically designed models.

• To prepare photo and text data for training a deep learning model.
• To design and train a deep learning caption generation model.
• To evaluate a trained caption generation model and use it to caption entirely new
photographs.

6
Methodology
• We have selected a dataset called “Flickr8k” which consists of 8000 images with 5
different captions.
• The image data is preprocessed using keras API and feed into VGG16 to extract features
and store them locally.

Fig 1: The complete architecture of VGG16  as in the paper “Very Deep Convolutional
Networks for Large-Scale Image Recognition”
7
Methodology
• The text data is cleaned to get rid of punctuations, numbers as well as articles.
• The dataset is divided into training and test set for further processing.
• We will feed the text data into an LSTM and extract the text features.

Fig 2: Basic representation of RNN and Feed-Forward Neural Network

8
Methodology

Fig 3: Vanilla RNN structure Fig 4: Structure of LSTM

9
Methodology

• Finally both the image and text data is feed into another multimodal network to generate
the final caption.

Fig. 5 Project Architectute


10
Methodology

• We will evaluate the performance of the model using BLEU(Bilingual Evaluation


Understudy) score.
• Improving the model by changing different parameters(e.g. Dropout, Optimizer,
Activation Function,number of dense layers)

11
12
Results

13
Results

14
Results

15
Results
BATCH SIZE No of Epochs Image Classifier LSTM BLEU-1

32 7 VGG16 128 0.472

64 13 VGG16 256 0.573

64 7 VGG16 256 0.591

64 8 VGG16 256 0.578

16
Softwares used

• Language : Python 3.6


• Software used : Spyder
• Libraries used : Keras with Tensorflow backend
• Dataset used : Flickr8k (A collection of 8000 images labelled with 5 captions),

17
Proposed Improvements

• Increase the amount of input data, use dataset like Flickr30K and check the output
• Change the CNN from VGG16 to InceptionV3 and try to improve the BLEU Score
further.
• Change the dropout amount,Activation Function etc. and check whether the BLEU score
can be improved or not.

18
Timeframe of the project

❖ Started ideation of this project in September 2020 and planning to complete it by April
2021.The detailed timeframe is explained in the timeline below.

Study of research papers and download of Coding and debugging and training the
software and other datasets. model with different parameters

November’2020 April’2021

September’2020 December’2020
Preprocessing the Image and text data Optimization of the model and
preparation for the conference paper

19
References
• Journal:
• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan . “Show and Tell: A Neural Image
Caption Generator” , No of pages - 9 , 20 April,2015
• Xiangyu Zhang, Jianhua Zou, Kaiming He† , and Jian Sun . “Accelerating Very Deep
Convolutional Networks for Classification and Detection”, No of pages - 14, 18 November, 2015.
• Marc Tanti , Albert Gatt,Kenneth P. Camilleri . “What is the Role of Recurrent Neural
Networks(RNNs) in an Image Caption Generator”, No of pages - 10.25 August,2017
• Alex Sherstinsky . “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term
Memory (LSTM) Network” , No of pages - 43, 31 May, 2020
• Sivasurya Santhanam . “Context based text-generation using LSTM networks” , No o pages-10,
30 April,2020

• Dataset Used:
• Flickr8k . https://www.kaggle.com/shadabhussain/flickr8k
• Flicker30K, https://www.kaggle.com/hsankesara/flickr-image-dataset

20
References
• Webpage:
• Muneeb ul Hassan , VGG16 - Convolutional Network for Classification and Detection, neurohive, 20
November, 2018 , https://neurohive.io/en/popular-networks/vgg16/ . [Accessed - 25-10-2020 ]
• Sumit Saha, A Comprehensive guide to Convolutional Neural Networks - the ELI5 way ,
towardsdatascience, 15 December,2018 ,
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3b
d2b1164a53
, [Accessed - 01-11-2020]
• MIT, Introduction to Deep Learning, http://introtodeeplearning.com/ , [Accessed – 01-11-2020]
• M. Jishan, “ImageToText: Image Caption Generation Using Hybrid Recurrent Neural Network”
,academia.edu,
https://www.academia.edu/40860840/ImageToText_Image_Caption_Generation_Using_Hybrid_Recurrent
_Neural_Network
.[Accessed - 15-11-2020]
• Pragati Baheti, “Introduction to Multimodal Deep Learning”. heartbeat.fritz.ai,
https://heartbeat.fritz.ai/introduction-to-multimodal-deep-learning-630b259f9291 ,[Accessed - 17-11-
2020]
• Richard Wanjohi, “Time Series Forecasting using LSTM in R”, rwanjohi.rbind.io,
https://rwanjohi.rbind.io/2018/04/05/time-series-forecasting-using-lstm-in-r/ .[Accessed - 22-11-2020]
21

You might also like