You are on page 1of 14

IMAGE CAPTIONING

- A DEEP LEARNING APPROACH

Presented By: Guided By:


Pornima Nikam (18141101)
Prof. C. P. Garware
Priyanka Gundesha (18141106)
Pallavi Bharti (18141112)
Introduction
Image Captioning is the process of generating textual
description of an image. It uses both Natural
Language Processing and Computer Vision to
generate the captions. It is a multi modal topic where
we will combine both image and text processing to
build a useful Deep Learning application.
Motivation 01 02
Image Search Guidance
Tool Device

We must first understand


how important it is to real 03 04
world scenario. Let us see Self Driving Web
few applications where this Cars Development
model can be a solution
Technologies Used
Platform, Tools and Dataset

01 02 03 04 05 06
Kaggle Keras with Pre-trained Flicker_8K CNN LSTM,
Platform tensorflow ResNet50 Dataset RNN
as backend model
CNN Convolutional
Network
Neural

A Convolutional Neural Network


(CNN) is a Deep Learning
algorithm which can take in an
input image, assign importance to
various aspects/objects in the
image and be able to differentiate
one from the other.
Recurrent Neural Network(RNN)
RNN are a type of neural network where
the output from previous step are
fed as input to the current step.
Recurrent Neural Network Main feature of RNN is Hidden
state, which remembers some
information about a sequence.
Resnet is short name for Residual
ResNet50 Network that supports Residual
Learning. The 50 indicates the
Residual Network
number of layers that it has. In
residual learning, instead of trying
to learn some features, try to learn
some residual. Residual can be
simply understood as subtraction of
feature learned from input of that
layer.
Flickr_8K LSTM

It contains 8000 images, For generating the captions, we


most of them featuring make use of Long Short-Term
people and animals in a state Memory (LSTM) networks.
of action. Each image is LSTMs are a variant of
provided with five different Recurrent Neural Networks
captions. which are widely used in
Natural Language Processing.
Implementation
Output
Accuracy and Predictions
30 75

20 50
Accuracy

Accuracy
10 25

0
0 46 47 48 49 50
2 3 4 5 6 7

Epoch Epoch
Future Scope

Monitoring Guidance Self Driving


Device Tool Cars

Speech Web
Conversion Application
Conclusion

Thus we have implemented a deep learning


approach for the captioning of images. The
sequential API of Keras was used with
Tenserflow as a backend to implement a
deep learning architecture.

You might also like