Professional Documents
Culture Documents
Image Captioning Approaches: Azam Moosavi
Image Captioning Approaches: Azam Moosavi
Azam Moosavi
Overview
Donahue Jeff, et al
Berkeley
LRCN
LRCN is a class of models that is both spatially and temporally deep, and has
the flexibility to be applied to a variety of vision tasks involving sequential
inputs and outputs.
Sequential inputs /outputs
Average
jumping
Activity Recognition Evaluation
Image description
CNN
CNN
CNN
Average
Saurabh Gupta
UC Berkeley
Hao Cheng, Li Deng, Jacob Devlin, Piotr Dollár, Hao Fang, Jianfeng Gao, Xiaodong He, Forrest
Iandola, Margaret Mitchell, John C. Platt, Rupesh Srivastava, C. Lawrence Zitnick, Geoffrey Zweig
woman
crowd
holding
camera cat
Purple
1. Word
Detection
holding
camera cat
Purple
holding
camera cat
Purple
3
Slide credit: Saurabh Gupta
woman
crowd
holding
camera cat
Purple
1. Word
Detection
CNN MIL
Per class
FC6, FC7, FC8 as fully probability
Multiple Instance
convolutional layers Learning
Spatial class
probability maps
Image
1
pijw
1 exp( (WwT (FC 7) bw ))
4
Slide credit: Saurabh Gupta
Multiple Instance Learning (MIL)
• In MIL, instead of giving the learner labels for the individual
examples, the trainer only labels collections of examples, which are
called bags.
Multi-Instance Learning
Unsupervised Supervised
In MIL terms, the image is described as a bag X {x1 , x 2 ,..., x n }where each𝑥iis the
feature vector (called instance) extracted from the corresponding i-th region in the image and
N is the total regions (instances) partitioning the image.
CNN MIL
Per class
FC6, FC7, FC8 as fully probability
Multiple Instance
convolutional layers Learning
Spatial class
probability maps
Image
p w 1 piw 1 (1 pijw )
1 exp( (WwT (FC 7) bw ))
ij
jbi
holding
camera cat
Purple
holding
holdin cat
g purple
camer crow
a d
holding
cat
purple
camer crow
a d
holding
cat
purple
holding
cat
purple
A woman holding
a camera in a
crowd.
Slide credit: Saurabh Gupta
Maximum entropy language model
holding
camera cat
Purple
A woman holding a
camera in a
crowd.
Where
DMSM -‐ Embedding
to maximize similarity
between image and its
corresponding caption Slide credit: Saurabh Gupta
Results
Caption generation performance for seven variants of our system on the
Microsoft COCO dataset