You are on page 1of 36

CMSC498L

Recurrent Neural Networks


Sweta Agrawal

Slides Adapted from CS498


Sentiment classification

• “The food was really good”

Classifier

h5

h1 h2 h3 h4

“The” “food” “was” “really” “good”


Image Caption Generation

“The dog is hiding”


Machine Translation

https://translate.google.com/
What makes Recurrent Networks so special?

Operation over sequences of vectors

Slide Credits: http://karpathy.github.io/2015/05/21/rnn-effectiveness/


How do RNNs work?

Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
How do RNNs work?

Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
Backpropagation Through Time and Vanishing
Gradients

Source:
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
Backpropagation Through Time and Vanishing
Gradients

Source:
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
The Problem of Long-Term Dependencies

Short-term dependency Long-term dependency

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks

Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
Gated Recurrent Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks

Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Use Cases

Multiple input – Sequence Classification


Single output

Single - Multiple Image Captioning

Multiple - Multiple Image Captioning

Multiple - Multiple Translation


Sequence Classification

Linear
Ignore Ignore
Classifier
h1 h2 hn
RNN RNN RNN
h1 h2 hn-1

The food good


Sequence Classification
Linear
Classifier

h = Sum(…)
h1 hn
h2
RNN RNN RNN
h1 h2 hn-1

The food good

http://deeplearning.net/tutorial/lstm.html
Image Caption Generation
“The” “dog” “is” “hiding” STOP

Classifier Classifier Classifier Classifier Classifier


h1 h2 h3 h4 h5

h0 h1 h2 h3 h4
CNN START “The” “dog” “is” “hiding”
Language model and Sequence
Generation
What is language modelling?
How likely is it to generate a given text?

I am going home.

I am going house.

P(“I”, “am”, “going”, “home”) > P(“I”, “am”, “going”, “house”)

Slides: http://users.umiacs.umd.edu/~jbg/teaching/CMSC_723/06a_lm_intro.pdf
Estimating P(w1, w2, .., wn )
Chain Rule:

P(w1, w2, w3 , …. ,wn ) = P(w1) P(w2 | w1 ) P(w3 |w1, w2 ) …. P(wn |w1 ,.., wn-1 )

Markov Assumption:

P(w1, w2, w3 , …. ,wn ) = Πi P(wi| wi-1wi-2...wi-k)


Neural Language Modelling: Character RNN

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Example
Multi-layer RNNs
• We can of course design RNNs with multiple hidden
layers y1 y2 y3 y4 y5 y6

x1 x2 x3 x4 x5 x6
• Anything goes: skip connections across layers, across time,

Word Representation

Distributional Hypothesis: Words that occur in the same contexts tend to have
similar meanings

Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Word based Cooccurence matrix

- Increase in size with


vocabulary
- Very High dimensional
- Sparsity issues

Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Solution: Low dimensional Vectors
Idea: Store “most” of the important information in a fixed, small number of
dimensions

Instead of capturing word co-occurrence counts directly, predict


surrounding words of every word

Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Skip-Gram model
● Represent each word as a d dimensional vector -> W
● Represent each context as a d dimensional vector -> V
● Initialize with random weights

Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Skip-Gram model
● Generate probabilities for observing surrounding words given context words
● Probability vector generated should match the true probabilities

Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Skip-Gram model

Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Negative Sampling

You might also like