Professional Documents
Culture Documents
By
Sourav Kumar Saikia - (17-16-022)
Bishal Nath - (17-16-049)
Bibhuti Baishya - (17-16-052)
1. Introduction
2. Literature Survey/Background Study
3. Motivation and Objective
4. Methodology
5. Work Done So Far
6. Work To Be Done.
7. Time frame of the project
8. References
2
Introduction
• The problem of generating natural language descriptions from visual data has long been
studied in computer vision, but mainly for video.(eg. Subtitles for YouTube videos)
• Leveraging recent advances in Computer Vision, their attributes and locations, allows
us to drive natural language generation systems to generate description for a given
image.
• We propose a neural and probabilistic framework to generate descriptions from images.
• The generative model is based on Deep Learning that combines recent advances in
Computer Vision & Natural Language Processing to generate english sentence
describing the image.
3
Literature Survey/Background Study
• We have studied the following journals.
Name Author Knowledge gained
Show and Tell: A Neural Image Oriol Vinyals, Alexander Toshev, Understanding the structure of
Caption Generator Samy Bengio, Dumitru Erhan image caption generation
model
Accelerating Very Deep Xiangyu Zhang, Jianhua Zou, Understanding of Convolutional
Convolutional Networks for Kaiming He, and Jian Sun Neural Networks
Classification and Detection
What is the Role of Recurrent Marc Tanti , Albert Gatt,Kenneth Understanding of different
Neural Networks(RNNs) in an P. Camilleri architectures that can be used
Image Caption Generator for caption generation
Fundamentals of Recurrent Neural Alex Sherstinsky Basic idea about RNN and
Network (RNN) and Long Short- LSTM
Term Memory (LSTM) Network
Context based text-generation Sivasurya Santhanam Understanding the text
using LSTM networks generation model with LSTM
5
Motivation and Objective
• To prepare photo and text data for training a deep learning model.
• To design and train a deep learning caption generation model.
• To evaluate a trained caption generation model and use it to caption entirely new
photographs.
6
Methodology
• We have selected a dataset called “Flickr8k” which consists of 8000 images with 5
different captions.
• The image data is preprocessed using keras API and feed into VGG16 to extract features
and store them locally.
Fig 1: The complete architecture of VGG16 as in the paper “Very Deep Convolutional
Networks for Large-Scale Image Recognition”
7
Methodology
• The text data is cleaned to get rid of punctuations, numbers as well as articles.
• The dataset is divided into training and test set for further processing.
• We will feed the text data into an LSTM and extract the text features.
8
Methodology
9
Methodology
• Finally both the image and text data is feed into another multimodal network to generate
the final caption.
11
12
Results
13
Results
14
Results
15
Results
BATCH SIZE No of Epochs Image Classifier LSTM BLEU-1
16
Softwares used
17
Proposed Improvements
• Increase the amount of input data, use dataset like Flickr30K and check the output
• Change the CNN from VGG16 to InceptionV3 and try to improve the BLEU Score
further.
• Change the dropout amount,Activation Function etc. and check whether the BLEU score
can be improved or not.
18
Timeframe of the project
❖ Started ideation of this project in September 2020 and planning to complete it by April
2021.The detailed timeframe is explained in the timeline below.
Study of research papers and download of Coding and debugging and training the
software and other datasets. model with different parameters
November’2020 April’2021
September’2020 December’2020
Preprocessing the Image and text data Optimization of the model and
preparation for the conference paper
19
References
• Journal:
• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan . “Show and Tell: A Neural Image
Caption Generator” , No of pages - 9 , 20 April,2015
• Xiangyu Zhang, Jianhua Zou, Kaiming He† , and Jian Sun . “Accelerating Very Deep
Convolutional Networks for Classification and Detection”, No of pages - 14, 18 November, 2015.
• Marc Tanti , Albert Gatt,Kenneth P. Camilleri . “What is the Role of Recurrent Neural
Networks(RNNs) in an Image Caption Generator”, No of pages - 10.25 August,2017
• Alex Sherstinsky . “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term
Memory (LSTM) Network” , No of pages - 43, 31 May, 2020
• Sivasurya Santhanam . “Context based text-generation using LSTM networks” , No o pages-10,
30 April,2020
• Dataset Used:
• Flickr8k . https://www.kaggle.com/shadabhussain/flickr8k
• Flicker30K, https://www.kaggle.com/hsankesara/flickr-image-dataset
20
References
• Webpage:
• Muneeb ul Hassan , VGG16 - Convolutional Network for Classification and Detection, neurohive, 20
November, 2018 , https://neurohive.io/en/popular-networks/vgg16/ . [Accessed - 25-10-2020 ]
• Sumit Saha, A Comprehensive guide to Convolutional Neural Networks - the ELI5 way ,
towardsdatascience, 15 December,2018 ,
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3b
d2b1164a53
, [Accessed - 01-11-2020]
• MIT, Introduction to Deep Learning, http://introtodeeplearning.com/ , [Accessed – 01-11-2020]
• M. Jishan, “ImageToText: Image Caption Generation Using Hybrid Recurrent Neural Network”
,academia.edu,
https://www.academia.edu/40860840/ImageToText_Image_Caption_Generation_Using_Hybrid_Recurrent
_Neural_Network
.[Accessed - 15-11-2020]
• Pragati Baheti, “Introduction to Multimodal Deep Learning”. heartbeat.fritz.ai,
https://heartbeat.fritz.ai/introduction-to-multimodal-deep-learning-630b259f9291 ,[Accessed - 17-11-
2020]
• Richard Wanjohi, “Time Series Forecasting using LSTM in R”, rwanjohi.rbind.io,
https://rwanjohi.rbind.io/2018/04/05/time-series-forecasting-using-lstm-in-r/ .[Accessed - 22-11-2020]
21