Image Captions With Deep Learning: Yulia Kogan & Ron Shiff

Image Captions With Deep Learning
Yulia Kogan & Ron Shiff

Lecture outline
Part 1 – NLP and RNN Introduction
• “The Unreasonable Effectiveness of Recurrent Neural Networks”

• Basic Recurrent Neural Network NLP example
• Long Short Term Memory RNN’s
Part 2 – Image Captioning Algorithms using RNN’s

The Unreasonable Effectiveness of Recurrent Neural Networks
• Taken from Andrej Karpathy’s blog

• So far – “Old School” Neural Networks – fixed length inputs and
outputs
• RNN’s - operate over sequences of vectors (input or output)
Image Sentiment Machine “Word

Captions Analysis Translation Prediction”
The Unreasonable Effectiveness of Recurrent
Neural Networks
• Algebraic Geometry-Latex
The Unreasonable Effectiveness of Recurrent
Neural Networks
• Shakespeare:
Word Vectors
• Classical Word Representation is “one hot”:
Each word is represented by a sparse vector y  R
|V |
Word Vectors
• A more modern approach: Represent words in a dense vector x  R d
(d  V )
• “Semantically” close vectors are
close In the Vector Space.
• Semantic Relations are preserved
in Vector Space:
“king”+”woman”-”man”=“queen”
Word Vectors
• A word Vector can be written as :
x  Wy where y is a “one hot” vector, W  R dx|V |
• Beneficial for most Deep learning tasks
RNN– Language Model
(Based on Richard Socher’s lecture – Deep Learning in NLP Stanford)
A language model computes a probability for a sequence of words:
P ( w1 ,..., wT )
Examples:
• Word ordering:
P ( the cat is small)  P (small is the cat )
• Word Choice:
P (I am going home)  P (I am going house)
Recurrent Neural Networks Language Model
• Each output depends on all previous inputs
RNN– Language Model
• Input : Word Vectors – x1 ,..., xt ,..., xT
• At each time, compute:
ht   (W hh ht 1  W hx xt )
yˆ t  soft max(W s ht 1 )
• Output: yˆ t , j  Pˆ ( xt 1  v j | xt ,..., x1 )
xt  R d ht  R Dh yˆ t  R |V |
W hx  R Dh xd W hh  R Dh xDh W s  R |V | xDh
Recurrent Neural Networks-Language Model
• Total Objective is to maximize the log-likelihood w.r.t parameters
5
T |V |
J ML ( )   yt , j log( yˆ t , j ) 4
t 1 j 1
3
yt  “one hot” vector containing the true word
-log(y)
2
yˆ  Pˆ ( x  v | x ,..., x )
t, j t 1 j t 1 1
• log-likelihood: -1
T T
0 0.5 y 1 1.5
log( P ( w1 ,..., wT ))  log( P ( wt | wt 1 ,..., w1 ))   log( P ( wt | wt 1 ,..., w1 ))

t 1 t 1
RNN’s – HARD TO TRAIN!
Vanishing/Exploding gradient problem
• For Stochastic Gradient Descent we calculate the derivative of the loss
w.r.t the Parameters:
• Reminder: ht   ( zt ) where: zt  Wht 1  W hx xt
• Applying Chain Rule:

J t J t yt t
ht hk

W yt ht

k 1 hk W
h
• Update equation: t   ( z t ) z t  Wht 1  W hx
xt
J t J t yt t
ht hk
• By Chain rule: 
W yt ht

k 1 hk W
ht t
hi t
hk
 
i  k 1 hi 1
  diag ( ' (zi ))W
i  k 1
ht t
hi t
     diag ( ' ( z i ))  W   
 W t k
hk i  k 1 hi 1 i  k 1
• Gradients can be very large or very small –
ht t
hi t
   diag ( ' ( zi ))  W   W 

t k

hk i  k 1 hi 1 i  k 1
• “small W” – vanishing gradient

• Long time dependencies
“Large W” (bad for optimization)
LSTM’s
• Long Short term memory
• Invented in 1991 by Hochreiter and Schmidnbaur
• Solved vanishing and exploding gradients using gating
Taken from Christopher Olah’s blog

LSTM’s Equations:
f t   (W f  ht 1 , xt  b f )
it   (Wi  ht 1 , xt  bi )
~
Ct   (WC  ht 1 , xt  bC )
~
Ct  f t  Ct 1  it  Ct
ot   (Wo  ht 1 , xt  bo )
ht  ot  tanh(Ct )
LSTM’s
• “Forget Gate”
f t   (W f  ht 1 , xt  b f )
LSTM’s
• “Input gate layer”
it   (Wi  ht 1 , xt  bi )
~
Ct   (WC  ht 1 , xt  bC )
LSTM’s
• Updating memory cell
~
Ct  f t  Ct 1  it  Ct
• No longer Exp.:
Ct  f t  f t 1  f t  2  Ct 3
• Information Can Flow:

~
f t  1  Ct  Ct 1  it  Ct
LSTM’s
• Finally, Setting the output
ot   (Wo  ht 1 , xt  bo )
ht  ot  tanh(Ct )
Conclusions
1. RNN’s are very powerful
2. RNN’s are hard to train
3. Nowadays - gating (LSTM’s) is the way to go!
Acknowledgments:
4. Andrej Karpathy - http://karpathy.github.io/2015/05/21/rnn-effectiveness
/
5. Richard Socher - http://cs224d.stanford.edu/
6. Christopher Olah -
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Image Captions With Deep Learning: Yulia Kogan & Ron Shiff

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Captions With Deep Learning: Yulia Kogan & Ron Shiff

Uploaded by

Copyright:

Available Formats

Image Captions With Deep Learning

Yulia Kogan & Ron Shiff

• “The Unreasonable Effectiveness of Recurrent Neural Networks”

Part 2 – Image Captioning Algorithms using RNN’s

• Taken from Andrej Karpathy’s blog

Image Sentiment Machine “Word

yt  “one hot” vector containing the true word

log( P ( w1 ,..., wT ))  log( P ( wt | wt 1 ,..., w1 ))   log( P ( wt | wt 1 ,..., w1 ))

• Applying Chain Rule:

   diag ( ' ( zi ))  W   W 

• “small W” – vanishing gradient

Taken from Christopher Olah’s blog

• Information Can Flow:

You might also like