You are on page 1of 3

Is the process of

generating accurate
and descriptive
captions for an
image
In image captioning, researchers use CNN and LSTM as an
encoder-decoder architecture in general

What is image Input Images


captioning?
CNN
Image Understanding Part

Image Captioning LSTM


Techniques Language Generation Part

There are two deep Generated Captions


learning models in image
captioning for AI

CNN (Convolutional Neural Network) LSTM (Long Short-Term Memory)


It is used for image classifications, computer vision, Is a type of RNN (recurrent neural network) that is
image recognition, and object detection. well-suited for sequence prediction problems
CNN image classifications take an input image, For example, based on the previous text, we can
process it, and classify it under certain categories predict what the next word will be.
(E.g., Dog, Cat, etc.).
What is image
Captioning Datasets ImageNet PASCAL VOC

Assortment of pictures with captions.


It is used to train models that can create
captions or descriptions for images in the
fields of computer vision and natural language
processing.

MSCOCO UIUC PASCAL


Most Popular Image
Captioning Datasets

Flickr8k Flickr8k
BLEU (Bilingual Evaluation Understudy)
Evaluation Techniques
The evaluation of closeness of the machine
translation to a human reference
translation.

ROUGE (Recall-Oriented Understudy for


Gisting Evaluation)
The evaluation comparing an automatically
Several metrics for produced summary or translation against a
automatic evaluation of set of reference summaries (typically
human-produced).
machine translation

METEOR (Metric for Evaluation of


Translation with Explicit Ordering)
The evaluation adopts synonyms matching
in the detection of similarity between
sentences.

SPICE (Semantic Propositional Image CIDEr (Consensus-based Image


Caption Evaluation) Description Evaluation)
The evaluation metric relates to the The evaluation metric compares a
semantic interrelationship between the generated sentence to a set of human-
generated and referenced sequence. written ground truth sentences for
Its graph-based methodology of semantic similarity.
representations indicates details of objects
and their interaction to describe their
textual illustrations.

You might also like