Image Captioning Approaches: Azam Moosavi

Image Captioning Approaches
Azam Moosavi
Overview
• Tasks involving sequences:

Image and video description
• Long-term recurrent convolutional networks

for visual recognition and description
• From Captions to Visual Concepts and Back

Long-term recurrent convolutional
networks for visual recognition and
description
Donahue Jeff, et al
Berkeley
LRCN
LRCN is a class of models that is both spatially and temporally deep, and has
the flexibility to be applied to a variety of vision tasks involving sequential
inputs and outputs.
Sequential inputs /outputs
Image credit: main paper

Activity recognition
CNN CNN CNN CNN
LSTM LSTM LSTM LSTM
running sitting jumping jumping
Average
jumping
Activity Recognition Evaluation
Image description
CNN
LSTM LSTM LSTM LSTM LSTM
<BOS> a dog is jumping <EOS>

Image description
CNN

Image description
Two layered factor
CNN

Image Description Evaluation
Image Description Evaluation
Video description
CNN CNN CNN CNN
Average

Video description
CNN CNN CNN CNN
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
N time step video frames

M time step produce description
Video description
Pre-trained detector predictions

Video Description
Figure credit: main paper

Video Description Evaluation
TACoS multilevel dataset, which has 44,762
video/sentence pairs (about 40,000 for
training/validation).
From Captions to Visual Concepts
and Back
Saurabh Gupta
UC Berkeley
Work done at Microsoft

Research
Hao Cheng, Li Deng, Jacob Devlin, Piotr Dollár, Hao Fang, Jianfeng Gao, Xiaodong He, Forrest
Iandola, Margaret Mitchell, John C. Platt, Rupesh Srivastava, C. Lawrence Zitnick, Geoffrey Zweig
woman
crowd
holding
camera cat
Purple
1. Word
Detection
woman, crowd, cat,

camera, holding,
purple
Slide credit: Saurabh Gupta
woman
crowd
holding
camera cat
Purple
1. Word 2. Sentence 3. Sentence

Detection
Generation Re7Ranking
woman, crowd, cat, A purple camera with a woman.
A woman holding a camera in a crowd.
camera, holding, ...
purple A woman holding a cat. #1 A woman holding a
camera in a crowd.
woman
crowd
holding
camera cat
Purple

Detection Re-Ranking
Generation
A woman holding a camera in a crowd. #1 A woman holding a
camera in a crowd.
purple A woman holding a cat.
3
woman
crowd
holding
camera cat
Purple
1. Word
Detection
woman, crowd, cat,

camera, holding,
purple
Caption generation:
Visual detectors:
Detects a set of words that are likely to be part of the image caption
CNN MIL
Per class
FC6, FC7, FC8 as fully probability
Multiple Instance
convolutional layers Learning
Spatial class
probability maps
Image
1
pijw 
1  exp( (WwT  (FC 7)  bw ))
4
Multiple Instance Learning (MIL)
• In MIL, instead of giving the learner labels for the individual
examples, the trainer only labels collections of examples, which are
called bags.
Multi-Instance Learning
Unsupervised Supervised
Kmeans Transductive Inference Neural Nets

PCA Co-training
Mixture ... Perceptron
Models s SVM
... ...
1. Recover “latent” structure, 1. Train classifier
but not a classifier 2. Success = low test error
2. Hope that structure is 3. Requires labels
“useful”
3. No labels required
A bag is labeled positive if there is at least one positive example in it

and it is labeled negative if all the examples in it are negative
Negative Bags (Bi-) Positive Bags (Bi+)

we want to know target class based on its visual content.
For instance, the target class might be "beach", where the image contains both "sand" and
"water".
In MIL terms, the image is described as a bag X  {x1 , x 2 ,..., x n }where each𝑥iis the
feature vector (called instance) extracted from the corresponding i-th region in the image and
N is the total regions (instances) partitioning the image.
The bag is labeled positive ("beach") if

it contains both "sand" region instances
and "water" region instances.
Image credit: Multiple instance classification, review,

taxonomy and comparative study
Jaume Amores (2013)
Noisy-Or for Estimating the
Density
It is assumed that the event can only happen if at least
one of the causations occurred
It is also assumed that the probability of any cause

failing to trigger the event is independent of any other
cause
Caption generation:
Visual detectors:
Detects a set of words that are likely o be part of the image caption
CNN MIL
Per class
FC6, FC7, FC8 as fully probability
Multiple Instance
convolutional layers Learning
Spatial class
probability maps
Image
p w 1 piw  1   (1  pijw )
1  exp( (WwT  (FC 7)  bw ))
ij
jbi

Visual detectors

woman
crowd
holding
camera cat
Purple

Detection
Generation Re7Ranking
A woman holding a camera in a crowd.
purple A woman holding a cat. #1 A woman holding a
camera in a crowd.
Probabilistic language
modeling
• Goal: compute the probability of a sentence or

sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
modeling
A woman
holding
holdin cat
g purple
camer crow
a d

modeling
A woman holding
holding
cat
purple
camer crow
a d

modeling
A woman holding
holding
cat
purple

modeling
A woman holding
holding
cat
purple
A woman holding
a camera in a
crowd.
Maximum entropy language model
The ME LM estimates the probability of a word conditioned on

preceding words and set of words in dictionary that yet to be
mentioned in the sentence.
To train the ME LM, the objective function is the log likelihood

of the captions conditioned on the corresponding set of detected
objects
Language model

woman
crowd
holding
camera cat
Purple

Detection Re-Ranking
Generation
A woman holding a camera in a crowd. #1 A woman holding a
camera in a crowd.
purple A woman holding a cat.

Re-‐rank hypotheses globally
Deep Multimodal Similarity Model (DMSM):
Text vector: yD Image vector :yQ
We measure similarity between images and
text by measuring cosine similarity between
their corresponding vectors. This cosine
similarity score is used by MERT to re-rank the
sentences.
we can compute the posterior probability of

the text being relevant to the image via:
A woman holding a
camera in a
crowd.
Where
DMSM -‐ Embedding
to maximize similarity
between image and its
corresponding caption Slide credit: Saurabh Gupta
Results
Caption generation performance for seven variants of our system on the
Microsoft COCO dataset

Thank You

Image Captioning Approaches: Azam Moosavi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Captioning Approaches: Azam Moosavi

Uploaded by

Copyright:

Available Formats

Image Captioning Approaches

• Tasks involving sequences:

• Long-term recurrent convolutional networks

• From Captions to Visual Concepts and Back

Image credit: main paper

CNN CNN CNN CNN

LSTM LSTM LSTM LSTM

running sitting jumping jumping

LSTM LSTM LSTM LSTM LSTM

<BOS> a dog is jumping <EOS>

LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM

<BOS> a dog is jumping <EOS>

LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM

<BOS> a dog is jumping <EOS>

CNN CNN CNN CNN

LSTM LSTM LSTM LSTM LSTM

<BOS> a dog is jumping <EOS>

CNN CNN CNN CNN

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

<BOS> a dog is jumping <EOS>

N time step video frames

LSTM LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM

<BOS> a dog is jumping <EOS>

Figure credit: main paper

Work done at Microsoft

woman, crowd, cat,

1. Word 2. Sentence 3. Sentence

1. Word 2. Sentence 3. Sentence

woman, crowd, cat,

Kmeans Transductive Inference Neural Nets

A bag is labeled positive if there is at least one positive example in it

Negative Bags (Bi-) Positive Bags (Bi+)

The bag is labeled positive ("beach") if

Image credit: Multiple instance classification, review,

It is also assumed that the probability of any cause

Slide credit: Saurabh Gupta

Slide credit: Saurabh Gupta

1. Word 2. Sentence 3. Sentence

• Goal: compute the probability of a sentence or

Slide credit: Saurabh Gupta

Slide credit: Saurabh Gupta

Slide credit: Saurabh Gupta

The ME LM estimates the probability of a word conditioned on

To train the ME LM, the objective function is the log likelihood

Slide credit: Saurabh Gupta

1. Word 2. Sentence 3. Sentence

Slide credit: Saurabh Gupta

we can compute the posterior probability of

Slide credit: Saurabh Gupta

You might also like