Professional Documents
Culture Documents
Submitted by
MAGESHKUMAR S (212221060155)
MOKESH M (212221060176)
SANJAY K (212221060240)
degree of
BACHELOR OF ENGINEERING
IN
i
SAVEETHA ENGINEERING COLLEGE (AUTONOMOUS),
CHENNAI
ANNA UNIVERSITY, CHENNAI – 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
ii
MINI PROJECT APPROVAL SHEET
iii
ABSTRACT
iv
ACKNOWLEDGMENT
We thank all our teaching and non- teaching faculty members of the
Department of Electronics and Communication for their passionate
support, for helping us to identify our mistakes and also for the
appreciation they gave us. We heartily thank our library staff and the
management for their extensive support in providing the resources and
information that helped us to complete the project successfully. Also, we
would like to record our deepest gratitude to our parents for their constant
encouragement and support, which motivated us a lot to complete our
project work.
v
TABLE OF CONTENTS
ABSTRACT iv
LIST OF ABBREVIATIONS x
1. INTRODUCTION 1
1.1 OVERVIEW 1
1.2 LSTM 1
1.3 CNN MODEL 2
1.3.1 ARCHTECTURE OF CNN 3
1.4 RNN MODEL 3
1.5 ARCHITECTURE OF LTSM 4
1.6 BACKEND 4
2. LITERATURE SURVEY 5
3. REQUIREMENT ANALYSIS 7
6 PERFORMANCE EVALUATION 17
6.1 TESTING THE MODEL 17
6.2 RESULT 17
vii
APPENDIX PAGE NO
APPENDIX 2 SCREENSHOT 20
viii
LIST OF FIGURES
CAPTIONING
6.2 RESULT 17
ix
LIST OF ABBREVIATIONS
x
CHAPTER 1
INTRODUCTION
1.3 OVERVIEW
Image captioning automatically generating natural description
according to the content observed in an image, is an important part
of scene understanding, which combines the knowledge of computer
vision and natural language processing.
1.2 LSTM
LSTM stands for Long short term memory, they are a type of RNN
which is well suited for sequence prediction problems. Based on the
previous text, we can predict what the next word will be. LSTM can
carry out relevant information throughout the processing of inputs
and with a forget gate, it discards non-relevant information. LSTM
with a soft attention as the decoder which selectively focuses the
attention over a certain part of an image to predict the next sentences.
LSTM describes from input image, by processing already extracted
feature maps.The LSTM as the language model for the decoder to
decode the vector into a sentence.
1
are the beginning and end of each sign. The network for generating
the captions will have to capture the words between these
tokens.The LSTM model learns to predict the next word St in the
commentary based on the vector of visual features and the previous
t-1 words. LSTM provides the next state vector ht and next word.
The context vector zt is a concatenation of the feature vector and
one hot vector of word representation.
Convolutional Neural networks are specialized deep neural networks which can
process the data that has input shape like a 2D matrix. Images are easily
represented as a 2D matrix and CNN is very useful in working with images. CNN
is basically used for image classifications and identifying if an image is a bird, a
plane or Superman, etc. Convolutional Neural Network (CNN) layers are used
for feature extraction on input data.
CNN model uses Loopy filter. This filter will detect the features of
the given data and generates a feature map. The advantage of CNN
model is that it is location invariant ie) it can detect the feature even
if it is located in any other location on the feature map. It scans
images from left to right and top to bottom to pull out important
features from the image and combines the feature to classify images.
2
1.3.1 ARCHITECTURE OF CNN
3
1.5 ARCHITECTURE OF LSTM
1.6 BACKEND
For the image caption generator, we will be using the Flickr8K and
Flickr text dataset. Flickr8K contains 8091 images. Flickr text
contains text files and caption of images.
4
CHAPTER 2
LITERATURE SURVEY
5
CNNs is that they are easier to train and have many fewer
parameters than fully connected networks with the same number of
hidden units. CNN have been widely used and studied for image
tasks, and are currently state-of-the art for object recognition and
detection.
Our approach is to infer these alignments and use them to learn a generative
model of descriptions. We develop a deep neural network model that infers the
alignment between segments of sentences and the region of the image that they
describe. We introduce a Recurrent Neural Network architecture that takes
aninput image and generates its description in text. Our experiments show that
the generated sentences produce sensible qualitative predictions.
6
CHAPTER 3
REQUIREMENT ANALYSIS
3.1.1 GPU
7
3.1.2 TPU
3.2.1 TENSORFLOW
3.2.2 KERAS
8
CHAPTER 4
9
4.4 UML DIAGRAMS
10
4.4.2 ACTIVITY DIAGRAM
11
4.4.3 ARCHITECTURE DIAGRAM FOR IMAGE
CAPTIONING
12
CHAPTER 5
SYSTEM IMPLEMENTATION
For loading the document file and reading the contents inside the file
into a string.
13
5.2.3 CLEANING TEXT
This function takes all descriptions and performs data cleaning. This
is an important step when we work with textual data, according to
our goal, we decide what type of cleaning we want to perform on
This is a simple function that will separate all the unique words and
create the vocabulary from all the descriptions.
This function will create a list of all the descriptions that have been
pre-processed and store them into a file. We will create a
14
different classes to classify. We can directly import this model from
the keras.applications . Make sure you are connected to the internet
as the weights get automatically downloaded. Since the Xception
model was originally built for imagenet, we will do little changes
for integrating with our model. One thing to notice is that the
Xception model takes 299*299*3 image size as input. We will
remove the last classification layer and get the 2048 feature vector.
model = Xception( include_top=False, pooling=’avg’ ) .The
function extract_features() will extract features for all images and
we will map image names with their respective feature array. Then
we will dump the features dictionary into a “features.p” pickle file.
• Load photos– This will load the text file in a string and will
return the list of image names.
Here we need to train our model on 6000 images and each image
will contain 2048 length feature vector and caption is also
represented as numbers. This amount of data for 6000 images is not
possible to hold into memory so we will be using a generator method
that will yield batches.
16
CHAPTER 6
PERFORMANCE EVALUATION
The model has been trained, now, we will make a separate file
testing_caption_generator.py which will load the model and
generate predictions. The predictions contain the max length of
index values so we will use the same tokenizer.p pickle file to get
the words from their index values. we encode all the test images and
save them in the file “encoded_test_images.pkl”.
6.2 RESULT
The above project was done and description for image came
successfully.
17
CHAPTER 7
7.1 CONCLUSION
18
APPENDIX 1
SOURCE CODE
19
APPENDIX 2
SCREENSHOT
20
REFERENCES