You are on page 1of 40

Learning From Data

12: Neural Networks – III


Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III Summer Semester 2022
Content

RNNs

LSTMs

Attention and Transformers

Bibliography

2/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
If training vanilla neural nets is optimization over functions,
training recurrent nets is optimization over programs.
– Andrej Karpathy

3/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Recurrent Neural Networks (RNNs)

“Classical” neural network architecture only addresses non- temporal


problems
Ignoring time dimension (or sequence) causes similar problems to
ignoring spatial dimension for images

Problem Statement
We need an architecture to properly capture and deal with temporal,
sequential data!

4/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNNs

RNN architecture (enrolled):

y y0 y1 y2 y3 yt

RNN h = RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

x x0 x1 x2 x3 xt

Recurrence formula: ht = fW (ht−1 , xt ).

5/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNNs Stacking
As usual we stack several layers

y0 y1 y2 y3 yt

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

x0 x1 x2 x3 xt

6/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNNs
Main ideas:
Process data sequentially
Introduce hidden state ht vector to keep internal memory
Define activation functions and architecture as follows:

ht = fW (ht−1 , xt ) e.g.
ht = tanh(Whh ht−1 + Wxh xt )
yt = Why ht or
yt = softmax(Why ht )

It can be proven that RNNs are Turing-Complete i.e. they can simulate
arbitrary programs.
Based on ideas of David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams [RHW86] and John Hopfield [Hop82].
7/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Use Cases

Sensor data classification / time series (many to many)


Language Translation (many to many)
Sentiment Classification (many to few or one)
Image Captioning (one to many)

8/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Example

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

9/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Example

Shakespeare Simulation
KING LEAR: O, if you were a feeble sight, the courtesy of
your law, Your sight and several breath, will wear the gods With
his heads, and my hands are wonder’d at the deeds, So drop
upon your lordship’s head, and your opinion Shall be against
your honour.
Source:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

10/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training

Backpropagation:
1. Run forward through entire sequence to compute the loss,
2. then backward through entire sequence to compute gradient.

11/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training
Loss

y0 y1 y2 y3 yt

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

x0 x1 x2 x3 xt

Forward
Backward

12/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training Problem
Vanishing / Exploding Gradient:

ht = tanh(Whh ht−1 + Wxh xt )


!!
h
= tanh (Whh Why ) t−1
xt
!!
ht−1
= tanh W
xt

Thus, if we repeat, we get


!!!!
ht−2
ht = tanh W tanh W
xt−1

and so on.
13/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
RNN Training Problem

Therefore, we have
!
h0
ht = tanh(W (tanh(W (. . . tanh(W )))))
x1

The many, repeated multiplications with W will blow up the gradient or


reduce it to zero!

Problem Statement
We need an architecture to properly take care of the vanishing/exploding
gradient problem!

14/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
Long Short Term Memory (LSTMs)
RNN Architecture LSTM Architecture
   
it σ
  
ht−1
ht = tanh W  ft 
 
xt   =  σ  W ht−1
 
ot   σ  xt
gt tanh
ct = ft−1 ct−1 + it−1 gt−1
ht = ot−1 tanh(ct )
For LSTMs we introduce four gates
1. i: Input gate - whether to write to cell
2. f: Forget gate - whether to erase cell
3. o: Output gate - how much to output
4. g: Gate “the” gate - content
and the memory cell ct and state ht .
Introduced by Hochreiter and Schmidhuber in 1997, see [HS97].
15/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
LSTM Architecture

ht-1 ht ht+1

x +
tanh

LSTM x x LSTM

σ σ tanh σ

xt-1 xt xt+1

16/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
LSTM Backpropagation

LSTMs solve the back propagation problem as ct is only multiplied


element-wise and no repeated matrix multiplication with W occurs.
For details, see [HS97].

Takeaway
Think of LSTMs as RNNS on steroids avoiding the vanishing /
exploding gradient problem.

17/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
LSTM Example
Human Activity Recognition (HAR) with Channel State Information (CSI)
from Radio [DHKS20]:

(a) CSI values for Sitting Down (b) CSI values for Fall

Figure: CSI values


18/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Gated Recurrent Units (GRU)

GRU Architecture

zt = σ(Wh xt + Uh (rt ht−1 ))


rt = σ(Wh xt + Uh (rt ht−1 ))
h̃t = φ(Wh xt + Uh (rt ht−1 ))
ht = (1 − zt ) ht−1 + zt h̃t

Think of an GRU as a simplified LSTM as it lacks an output gate. They were


invented by Cho at al. [CvMG+ 14].

19/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
FIND - ME @ THE . WEB

Focusing Learning by Attention


Figure
In 2015 1.
XuOur
et almodel
appliedlearns a words/image
the idea alignment.
of attention (that was in The visual-
the air at the
ized anyway)
time attentional maps used
to RNNs (3) are
for explained
captioning in section
[XBK + 15]: 3.1 & 5.4

Source: [XBK+ 15]

has significantly improved the quality of caption genera-


20/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention Example

Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word)

two variants: a “hard” attention mechanism and a “soft” 2. Related Work


attention mechanism. We also show how one advantage of [XBK+ 15]
Source:
including attention is the ability to visualize what the model In this section we provide relevant background on previous
“sees”. Encouraged by recent advances in caption genera- work on image caption generation and attention. Recently,
tion and inspired by recent success in employing attention several methods have been proposed for generating image
in machine translation (Bahdanau et al., 2014) and object descriptions. Many of these methods are based on recur-
recognition
21/40 (Ba etSchäfer
Jörg al., 2014; Mnih
| Learning et Data
From al., 2014),
| cbna we
12:investi- rent
Neural Networks – IIIneural networks and inspired by the successful use of
Attention Failure
Neural Image Caption Generation with Visual Attention

Figure 5. Examples of mistakes where we can use attention to gain intuition into what the model saw.

Source: [XBK+ 15]


Equation 11 suggests a Monte Carlo based sampling ap- following:
proximation of the gradient with respect to the model pa-
N 
rameters. This can be done by sampling the location st @Ls 1 X @ log p(y | s̃n , a)
from a multinouilli distribution defined by Equation 8. ⇡ +
@W N n=1 @W
22/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III @ log p(s̃n | a) @H[s̃n ]
Attention Principle

Fully Connected
Convolution

Local Attention Global Attention

23/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention Principle

Before attention and transformers, Sequence to Sequence (Seq2Seq)


architectures using RNNs were used:

x = {x1 , . . . , xt } 7→ Encoder 7→ Decoder 7→ y = {y1 , . . . , yt0 }

Problem
While this works well for short sequences (t < 20), it fails for long
sequences.

24/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
Attention Principle

We want to weight the sensitivities of the network to the input based


on memory from previous inputs
This we call “attention”.
Mathematically,

X
zi = αij hj , where
j
exp(eij )
αij = P (softmax)
k exp(eik )
ei = f (yi−1 , h)

25/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention Heatmap

Source: http://nlp.seas.harvard.edu/2018/04/03/attention.html

26/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Transformers

Transformers were defined in the influential paper “Attention is all you


need” in 2017 by Vaswani et al. [VSP+ 17]
Recap – RNN with Encoder/Decoder architecture:

ht+1 = σ(WE ht + QE xt+1 )


gt+1 = σ(WD gt + QD yt )
0
yt+1 = softmax(WY gt+1 )

Problem: All the (unbounded) information in the sentence / sequence has


to be stored in the (bounded) internal state vector ht .

27/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Transformers

Key concepts:
1. Idea: Use attention instead of internal state vector ht . This allows for
unbounded sequences and also enables parallelism.
2. Use embedding to store sequences as sets in embedding vector space.
3. Use positional embedding to keep track of otherwise lost order
information.
4. Use Query, Key, Value (Q, K , V ) concept - similar to information
retrieval systems.
5. Multi-Heads for selective, parallel attention
6. Masking to ensure causality
We will explore these concepts in detail below.

28/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Attention

We make use of the well-known attention concept, but modify it a little:


We use the attention mechanism as described above.
In addition, we apply self attention by computing the correlations
of tokens (words) to themselves.
To achieve this, we need to treat tokens as vectors by making use of an
embedding φ:
φ : {Words} −→ Rn

29/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Published as a conference paper at ICLR 2015

Embedding

(a) (b)
Source: [BCB15]

Remark: Watch the reversed order (“la zone économique européenne” vs.
“the European economic area”)!
30/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Positional Embedding

We want to keep track of the position (order) of elements. To this end,


we define the following mapping
(
i sin(ωk t) if i = 2k
x (t) := ,
cos(ωk t) if i = 2k + 1

where
1
ωk :=
100002k/d

31/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Positional Embedding

Source: Transformer Architecture: The Positional Encoding, Amirhossein Kazemnejad

https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

32/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Query, Key, Value (Q, K , V ) concept
We introduce three weight matrices WQ , WK , and WV and reuse X three times:

Q := X WQ
K := X WK
V := X WV

Think of this as an information retrieval system.


Then the attention is computed as

QK T 
attention(Q, K , V ) := softmax √ V.
dk

Intuition: The dot-product QK T computes the similarity between Q and K


<a,b>
based on the geometric interpretation: cos(φ]a,b ) = ||a||||b||

33/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Multi Heads

Simple idea: To attend to different context, we just take multiple copies


of the attention mechanism.
I.e. we call a head the following expression

headi := attention(Qi , Ki , Vi ),
and
multi-head := concatenate(head1 , . . . , headn )

34/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Masking to Ensure Causality

Problem:
The decoder is identical to the encoder, except that it is not allowed
to “look ahead”, i.e.
We do not want to know the tokens from the future, but rather
predict them from the past.
Solution:
In the decoder, the self-attention layer is only allowed to attend to
earlier, “past” positions in the output sequence.
This is achieved by masking future positions (setting them to − inf)
before the softmax sis applied.

35/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
Transformer Architecture

Figure 1: The Transformer - model architecture.


Source: [VSP+ 17]
wise fully connected feed-forward network. We employ a residual connection [10] around each of
Remark: We usually stack many of the encoders and decoders together.
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
36/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
Transformers and Hopffield
In 2020 Ramsauer et al. (including Sepp Hochreiter) [RSL+ 21] proved
that the Transformer Update Rule

QK T 
attention(Q, K , V ) := softmax √ V
dk
is equivalent to the update rule for so-called “modern” Hopfield networks,
characterised by its Energy function (Free Energy)
1 1
E = −lse(β, X T ξ) + ξ t ξ + β −1 log N + M 2 ,
2 2
where lse denotes the log-sum-expression, i.e.
lse(β, x ) := −β log( exp(βx )).
P

Thus, in the end we came full circle. . .

37/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
References I

[BCB15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation


by jointly learning to align and translate,” in ICLR 2015, Jan.
2015, 3rd International Conference on Learning Representations,
ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.

[CvMG+ 14] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk,


and Y. Bengio, “Learning phrase representations using RNN
encoder-decoder for statistical machine translation,” CoRR, vol.
abs/1406.1078, 2014. [Online]. Available:
http://arxiv.org/abs/1406.1078

[DHKS20] N. Damodaran, E. Haruni, M. Kokhkharova, and J. Schäfer,


“Device free human activity and fall recognition using wifi channel
state information (CSI),” CCF Transactions on Pervasive
Computing and Interaction, vol. 2, pp. 1–17, January 2020.

38/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
References II

[Hop82] J. J. Hopfield, “Neural networks and physical systems with


emergent collective computational abilities,” Proceedings of the
National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558,
1982. [Online]. Available:
https://www.pnas.org/content/79/8/2554

[HS97] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”


Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
[Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning


representations by back-propagating errors,” Nature, vol. 323, no.
6088, pp. 533–536, 1986. [Online]. Available:
https://doi.org/10.1038/323533a0

39/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
References III
[RSL+ 21] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler,
L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff,
D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and
S. Hochreiter, “Hopfield networks is all you need,” 2021.

[VSP+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.


Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Proceedings of the 31st International Conference on Neural
Information Processing Systems, ser. NIPS’17. Red Hook, NY,
USA: Curran Associates Inc., 2017, pp. 6000–6010.

[XBK+ 15] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,


R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
caption generation with visual attention,” CoRR, vol.
abs/1502.03044, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03044

40/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III

You might also like