Professional Documents
Culture Documents
RNNs
LSTMs
Bibliography
2/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
If training vanilla neural nets is optimization over functions,
training recurrent nets is optimization over programs.
– Andrej Karpathy
3/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Recurrent Neural Networks (RNNs)
Problem Statement
We need an architecture to properly capture and deal with temporal,
sequential data!
4/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNNs
y y0 y1 y2 y3 yt
x x0 x1 x2 x3 xt
5/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNNs Stacking
As usual we stack several layers
y0 y1 y2 y3 yt
x0 x1 x2 x3 xt
6/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNNs
Main ideas:
Process data sequentially
Introduce hidden state ht vector to keep internal memory
Define activation functions and architecture as follows:
ht = fW (ht−1 , xt ) e.g.
ht = tanh(Whh ht−1 + Wxh xt )
yt = Why ht or
yt = softmax(Why ht )
It can be proven that RNNs are Turing-Complete i.e. they can simulate
arbitrary programs.
Based on ideas of David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams [RHW86] and John Hopfield [Hop82].
7/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Use Cases
8/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Example
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
9/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Example
Shakespeare Simulation
KING LEAR: O, if you were a feeble sight, the courtesy of
your law, Your sight and several breath, will wear the gods With
his heads, and my hands are wonder’d at the deeds, So drop
upon your lordship’s head, and your opinion Shall be against
your honour.
Source:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
10/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training
Backpropagation:
1. Run forward through entire sequence to compute the loss,
2. then backward through entire sequence to compute gradient.
11/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training
Loss
y0 y1 y2 y3 yt
x0 x1 x2 x3 xt
Forward
Backward
12/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training Problem
Vanishing / Exploding Gradient:
and so on.
13/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training Problem
Therefore, we have
!
h0
ht = tanh(W (tanh(W (. . . tanh(W )))))
x1
Problem Statement
We need an architecture to properly take care of the vanishing/exploding
gradient problem!
14/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
Long Short Term Memory (LSTMs)
RNN Architecture LSTM Architecture
it σ
ht−1
ht = tanh W ft
xt = σ W ht−1
ot σ xt
gt tanh
ct = ft−1 ct−1 + it−1 gt−1
ht = ot−1 tanh(ct )
For LSTMs we introduce four gates
1. i: Input gate - whether to write to cell
2. f: Forget gate - whether to erase cell
3. o: Output gate - how much to output
4. g: Gate “the” gate - content
and the memory cell ct and state ht .
Introduced by Hochreiter and Schmidhuber in 1997, see [HS97].
15/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
LSTM Architecture
ht-1 ht ht+1
x +
tanh
LSTM x x LSTM
σ σ tanh σ
xt-1 xt xt+1
16/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
LSTM Backpropagation
Takeaway
Think of LSTMs as RNNS on steroids avoiding the vanishing /
exploding gradient problem.
17/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
LSTM Example
Human Activity Recognition (HAR) with Channel State Information (CSI)
from Radio [DHKS20]:
(a) CSI values for Sitting Down (b) CSI values for Fall
GRU Architecture
19/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
FIND - ME @ THE . WEB
Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word)
Figure 5. Examples of mistakes where we can use attention to gain intuition into what the model saw.
Fully Connected
Convolution
23/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention Principle
Problem
While this works well for short sequences (t < 20), it fails for long
sequences.
24/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
Attention Principle
X
zi = αij hj , where
j
exp(eij )
αij = P (softmax)
k exp(eik )
ei = f (yi−1 , h)
25/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention Heatmap
Source: http://nlp.seas.harvard.edu/2018/04/03/attention.html
26/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Transformers
27/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Transformers
Key concepts:
1. Idea: Use attention instead of internal state vector ht . This allows for
unbounded sequences and also enables parallelism.
2. Use embedding to store sequences as sets in embedding vector space.
3. Use positional embedding to keep track of otherwise lost order
information.
4. Use Query, Key, Value (Q, K , V ) concept - similar to information
retrieval systems.
5. Multi-Heads for selective, parallel attention
6. Masking to ensure causality
We will explore these concepts in detail below.
28/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention
29/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Published as a conference paper at ICLR 2015
Embedding
(a) (b)
Source: [BCB15]
Remark: Watch the reversed order (“la zone économique européenne” vs.
“the European economic area”)!
30/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Positional Embedding
where
1
ωk :=
100002k/d
31/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Positional Embedding
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
32/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Query, Key, Value (Q, K , V ) concept
We introduce three weight matrices WQ , WK , and WV and reuse X three times:
Q := X WQ
K := X WK
V := X WV
QK T
attention(Q, K , V ) := softmax √ V.
dk
33/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Multi Heads
headi := attention(Qi , Ki , Vi ),
and
multi-head := concatenate(head1 , . . . , headn )
34/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Masking to Ensure Causality
Problem:
The decoder is identical to the encoder, except that it is not allowed
to “look ahead”, i.e.
We do not want to know the tokens from the future, but rather
predict them from the past.
Solution:
In the decoder, the self-attention layer is only allowed to attend to
earlier, “past” positions in the output sequence.
This is achieved by masking future positions (setting them to − inf)
before the softmax sis applied.
35/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Transformer Architecture
QK T
attention(Q, K , V ) := softmax √ V
dk
is equivalent to the update rule for so-called “modern” Hopfield networks,
characterised by its Energy function (Free Energy)
1 1
E = −lse(β, X T ξ) + ξ t ξ + β −1 log N + M 2 ,
2 2
where lse denotes the log-sum-expression, i.e.
lse(β, x ) := −β log( exp(βx )).
P
37/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
References I
38/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
References II
39/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
References III
[RSL+ 21] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler,
L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff,
D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and
S. Hochreiter, “Hopfield networks is all you need,” 2021.
40/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III