Professional Documents
Culture Documents
Classifier
h5
h1 h2 h3 h4
https://translate.google.com/
What makes Recurrent Networks so special?
Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
How do RNNs work?
Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
Backpropagation Through Time and Vanishing
Gradients
Source:
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
Backpropagation Through Time and Vanishing
Gradients
Source:
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/
The Problem of Long-Term Dependencies
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks
Source: https://en.wikipedia.org/wiki/Recurrent_neural_network
Gated Recurrent Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short Term Networks
Slides: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Use Cases
Linear
Ignore Ignore
Classifier
h1 h2 hn
RNN RNN RNN
h1 h2 hn-1
h = Sum(…)
h1 hn
h2
RNN RNN RNN
h1 h2 hn-1
http://deeplearning.net/tutorial/lstm.html
Image Caption Generation
“The” “dog” “is” “hiding” STOP
h0 h1 h2 h3 h4
CNN START “The” “dog” “is” “hiding”
Language model and Sequence
Generation
What is language modelling?
How likely is it to generate a given text?
I am going home.
I am going house.
Slides: http://users.umiacs.umd.edu/~jbg/teaching/CMSC_723/06a_lm_intro.pdf
Estimating P(w1, w2, .., wn )
Chain Rule:
P(w1, w2, w3 , …. ,wn ) = P(w1) P(w2 | w1 ) P(w3 |w1, w2 ) …. P(wn |w1 ,.., wn-1 )
Markov Assumption:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Example
Multi-layer RNNs
• We can of course design RNNs with multiple hidden
layers y1 y2 y3 y4 y5 y6
x1 x2 x3 x4 x5 x6
• Anything goes: skip connections across layers, across time,
…
Word Representation
Distributional Hypothesis: Words that occur in the same contexts tend to have
similar meanings
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Word based Cooccurence matrix
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Solution: Low dimensional Vectors
Idea: Store “most” of the important information in a fixed, small number of
dimensions
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Skip-Gram model
● Represent each word as a d dimensional vector -> W
● Represent each context as a d dimensional vector -> V
● Initialize with random weights
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Skip-Gram model
● Generate probabilities for observing surrounding words given context words
● Probability vector generated should match the true probabilities
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Skip-Gram model
Slides: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf
Negative Sampling