You are on page 1of 24

Image Captions With Deep Learning

Yulia Kogan & Ron Shiff


Lecture outline
Part 1 – NLP and RNN Introduction

• “The Unreasonable Effectiveness of Recurrent Neural Networks”


• Basic Recurrent Neural Network NLP example
• Long Short Term Memory RNN’s

Part 2 – Image Captioning Algorithms using RNN’s


The Unreasonable Effectiveness of Recurrent Neural Networks

• Taken from Andrej Karpathy’s blog


• So far – “Old School” Neural Networks – fixed length inputs and
outputs
• RNN’s - operate over sequences of vectors (input or output)

Image Sentiment Machine “Word


Captions Analysis Translation Prediction”
The Unreasonable Effectiveness of Recurrent
Neural Networks
• Algebraic Geometry-Latex
The Unreasonable Effectiveness of Recurrent
Neural Networks
• Shakespeare:
Word Vectors
• Classical Word Representation is “one hot”:
Each word is represented by a sparse vector y  R
|V |
Word Vectors
• A more modern approach: Represent words in a dense vector x  R d

(d  V )
• “Semantically” close vectors are
close In the Vector Space.
• Semantic Relations are preserved
in Vector Space:
“king”+”woman”-”man”=“queen”
Word Vectors
• A word Vector can be written as :
x  Wy where y is a “one hot” vector, W  R dx|V |
• Beneficial for most Deep learning tasks
RNN– Language Model
(Based on Richard Socher’s lecture – Deep Learning in NLP Stanford)
A language model computes a probability for a sequence of words:
P ( w1 ,..., wT )

Examples:
• Word ordering:
P ( the cat is small)  P (small is the cat )
• Word Choice:
P (I am going home)  P (I am going house)
Recurrent Neural Networks Language Model
• Each output depends on all previous inputs
RNN– Language Model
• Input : Word Vectors – x1 ,..., xt ,..., xT
• At each time, compute:
ht   (W hh ht 1  W hx xt )
yˆ t  soft max(W s ht 1 )

• Output: yˆ t , j  Pˆ ( xt 1  v j | xt ,..., x1 )

xt  R d ht  R Dh yˆ t  R |V |
W hx  R Dh xd W hh  R Dh xDh W s  R |V | xDh
Recurrent Neural Networks-Language Model
• Total Objective is to maximize the log-likelihood w.r.t parameters
5
T |V |
J ML ( )   yt , j log( yˆ t , j ) 4

t 1 j 1
3

yt  “one hot” vector containing the true word

-log(y)
2

yˆ  Pˆ ( x  v | x ,..., x )
t, j t 1 j t 1 1

• log-likelihood: -1

T T
0 0.5 y 1 1.5

log( P ( w1 ,..., wT ))  log( P ( wt | wt 1 ,..., w1 ))   log( P ( wt | wt 1 ,..., w1 ))


t 1 t 1
RNN’s – HARD TO TRAIN!
Vanishing/Exploding gradient problem
• For Stochastic Gradient Descent we calculate the derivative of the loss
w.r.t the Parameters:
• Reminder: ht   ( zt ) where: zt  Wht 1  W hx xt

• Applying Chain Rule:


J t J t yt t
ht hk

W yt ht

k 1 hk W
Vanishing/Exploding gradient problem
h
• Update equation: t   ( z t ) z t  Wht 1  W hx
xt

J t J t yt t
ht hk
• By Chain rule: 
W yt ht

k 1 hk W

ht t
hi t

hk
 
i  k 1 hi 1
  diag ( ' (zi ))W
i  k 1

ht t
hi t
     diag ( ' ( z i ))  W   
 W t k

hk i  k 1 hi 1 i  k 1
Vanishing/Exploding gradient problem
• Gradients can be very large or very small –

ht t
hi t

   diag ( ' ( zi ))  W   W 


t k

hk i  k 1 hi 1 i  k 1

• “small W” – vanishing gradient


• Long time dependencies
“Large W” (bad for optimization)
LSTM’s
• Long Short term memory
• Invented in 1991 by Hochreiter and Schmidnbaur
• Solved vanishing and exploding gradients using gating

Taken from Christopher Olah’s blog


LSTM’s Equations:

f t   (W f  ht 1 , xt  b f )
it   (Wi  ht 1 , xt  bi )
~
Ct   (WC  ht 1 , xt  bC )
~
Ct  f t  Ct 1  it  Ct
ot   (Wo  ht 1 , xt  bo )
ht  ot  tanh(Ct )
LSTM’s
• “Forget Gate”
f t   (W f  ht 1 , xt  b f )
LSTM’s
• “Input gate layer”
it   (Wi  ht 1 , xt  bi )
~
Ct   (WC  ht 1 , xt  bC )
LSTM’s
• Updating memory cell
~
Ct  f t  Ct 1  it  Ct

• No longer Exp.:
Ct  f t  f t 1  f t  2  Ct 3

• Information Can Flow:


~
f t  1  Ct  Ct 1  it  Ct
LSTM’s
• Finally, Setting the output
ot   (Wo  ht 1 , xt  bo )
ht  ot  tanh(Ct )
Conclusions
1. RNN’s are very powerful
2. RNN’s are hard to train
3. Nowadays - gating (LSTM’s) is the way to go!

Acknowledgments:
4. Andrej Karpathy - http://karpathy.github.io/2015/05/21/rnn-effectiveness
/
5. Richard Socher - http://cs224d.stanford.edu/
6. Christopher Olah -
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

You might also like