Learning From Data: 12: Neural Networks - III

Learning From Data
12: Neural Networks – III

Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Wissen durch Praxis stärkt

1/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III Summer Semester 2022
Content
RNNs
LSTMs
Attention and Transformers
Bibliography
2/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III
If training vanilla neural nets is optimization over functions,
training recurrent nets is optimization over programs.
– Andrej Karpathy
Recurrent Neural Networks (RNNs)
“Classical” neural network architecture only addresses non- temporal

problems
Ignoring time dimension (or sequence) causes similar problems to
ignoring spatial dimension for images
Problem Statement
We need an architecture to properly capture and deal with temporal,
sequential data!
RNNs
RNN architecture (enrolled):
y y0 y1 y2 y3 yt
RNN h = RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN
x x0 x1 x2 x3 xt
Recurrence formula: ht = fW (ht−1 , xt ).
RNNs Stacking
As usual we stack several layers
y0 y1 y2 y3 yt
RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN
x0 x1 x2 x3 xt
RNNs
Main ideas:
Process data sequentially
Introduce hidden state ht vector to keep internal memory
Define activation functions and architecture as follows:
ht = fW (ht−1 , xt ) e.g.
ht = tanh(Whh ht−1 + Wxh xt )
yt = Why ht or
yt = softmax(Why ht )
It can be proven that RNNs are Turing-Complete i.e. they can simulate
arbitrary programs.
Based on ideas of David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Williams [RHW86] and John Hopfield [Hop82].
RNN Use Cases
Sensor data classification / time series (many to many)

Language Translation (many to many)
Sentiment Classification (many to few or one)
Image Captioning (one to many)
RNN Example
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNN Example
Shakespeare Simulation
KING LEAR: O, if you were a feeble sight, the courtesy of
your law, Your sight and several breath, will wear the gods With
his heads, and my hands are wonder’d at the deeds, So drop
upon your lordship’s head, and your opinion Shall be against
your honour.
Source:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNN Training
Backpropagation:
1. Run forward through entire sequence to compute the loss,
2. then backward through entire sequence to compute gradient.
RNN Training
Loss
y0 y1 y2 y3 yt
x0 x1 x2 x3 xt
Forward
Backward
RNN Training Problem
Vanishing / Exploding Gradient:
ht = tanh(Whh ht−1 + Wxh xt )

!!
h
= tanh (Whh Why ) t−1
xt
!!
ht−1
= tanh W
xt
Thus, if we repeat, we get

!!!!
ht−2
ht = tanh W tanh W
xt−1
and so on.
RNN Training Problem
Therefore, we have
!
h0
ht = tanh(W (tanh(W (. . . tanh(W )))))
x1
The many, repeated multiplications with W will blow up the gradient or

reduce it to zero!
Problem Statement
We need an architecture to properly take care of the vanishing/exploding
gradient problem!
14/40 Jörg Schäfer | Learning From Data | c b na 12: Neural Networks – III
Long Short Term Memory (LSTMs)
RNN Architecture LSTM Architecture
   
it σ

ht−1
ht = tanh W  ft 

xt   =  σ  W ht−1
 
ot   σ  xt
gt tanh
ct = ft−1 ct−1 + it−1 gt−1
ht = ot−1 tanh(ct )
For LSTMs we introduce four gates
1. i: Input gate - whether to write to cell
2. f: Forget gate - whether to erase cell
3. o: Output gate - how much to output
4. g: Gate “the” gate - content
and the memory cell ct and state ht .
Introduced by Hochreiter and Schmidhuber in 1997, see [HS97].
LSTM Architecture
ht-1 ht ht+1
x +
tanh
LSTM x x LSTM
σ σ tanh σ
xt-1 xt xt+1
LSTM Backpropagation
LSTMs solve the back propagation problem as ct is only multiplied

element-wise and no repeated matrix multiplication with W occurs.
For details, see [HS97].
Takeaway
Think of LSTMs as RNNS on steroids avoiding the vanishing /
exploding gradient problem.
LSTM Example
Human Activity Recognition (HAR) with Channel State Information (CSI)
from Radio [DHKS20]:
(a) CSI values for Sitting Down (b) CSI values for Fall
Figure: CSI values

Gated Recurrent Units (GRU)
GRU Architecture
zt = σ(Wh xt + Uh (rt ht−1 ))

rt = σ(Wh xt + Uh (rt ht−1 ))
h̃t = φ(Wh xt + Uh (rt ht−1 ))
ht = (1 − zt ) ht−1 + zt h̃t
Think of an GRU as a simplified LSTM as it lacks an output gate. They were

invented by Cho at al. [CvMG+ 14].
FIND - ME @ THE . WEB
Focusing Learning by Attention

Figure
In 2015 1.
XuOur
et almodel
appliedlearns a words/image
the idea alignment.
of attention (that was in The visual-
the air at the
ized anyway)
time attentional maps used
to RNNs (3) are
for explained
captioning in section
[XBK + 15]: 3.1 & 5.4
Source: [XBK+ 15]
has significantly improved the quality of caption genera-

Attention Example
Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corresponding word)
two variants: a “hard” attention mechanism and a “soft” 2. Related Work

attention mechanism. We also show how one advantage of [XBK+ 15]
Source:
including attention is the ability to visualize what the model In this section we provide relevant background on previous
“sees”. Encouraged by recent advances in caption genera- work on image caption generation and attention. Recently,
tion and inspired by recent success in employing attention several methods have been proposed for generating image
in machine translation (Bahdanau et al., 2014) and object descriptions. Many of these methods are based on recur-
recognition
21/40 (Ba etSchäfer
Jörg al., 2014; Mnih
| Learning et Data
From al., 2014),
| cbna we
12:investi- rent
Neural Networks – IIIneural networks and inspired by the successful use of
Attention Failure
Neural Image Caption Generation with Visual Attention
Figure 5. Examples of mistakes where we can use attention to gain intuition into what the model saw.
Source: [XBK+ 15]

Equation 11 suggests a Monte Carlo based sampling ap- following:
proximation of the gradient with respect to the model pa-
N 
rameters. This can be done by sampling the location st @Ls 1 X @ log p(y | s̃n , a)
from a multinouilli distribution defined by Equation 8. ⇡ +
@W N n=1 @W
22/40 Jörg Schäfer | Learning From Data | c b n a 12: Neural Networks – III @ log p(s̃n | a) @H[s̃n ]
Attention Principle
Fully Connected
Convolution
Local Attention Global Attention
Attention Principle
Before attention and transformers, Sequence to Sequence (Seq2Seq)

architectures using RNNs were used:
x = {x1 , . . . , xt } 7→ Encoder 7→ Decoder 7→ y = {y1 , . . . , yt0 }
Problem
While this works well for short sequences (t < 20), it fails for long
sequences.
Attention Principle
We want to weight the sensitivities of the network to the input based

on memory from previous inputs
This we call “attention”.
Mathematically,
X
zi = αij hj , where
j
exp(eij )
αij = P (softmax)
k exp(eik )
ei = f (yi−1 , h)
Attention Heatmap
Source: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Transformers
Transformers were defined in the influential paper “Attention is all you

need” in 2017 by Vaswani et al. [VSP+ 17]
Recap – RNN with Encoder/Decoder architecture:
ht+1 = σ(WE ht + QE xt+1 )

gt+1 = σ(WD gt + QD yt )
0
yt+1 = softmax(WY gt+1 )
Problem: All the (unbounded) information in the sentence / sequence has

to be stored in the (bounded) internal state vector ht .
Transformers
Key concepts:
1. Idea: Use attention instead of internal state vector ht . This allows for
unbounded sequences and also enables parallelism.
2. Use embedding to store sequences as sets in embedding vector space.
3. Use positional embedding to keep track of otherwise lost order
information.
4. Use Query, Key, Value (Q, K , V ) concept - similar to information
retrieval systems.
5. Multi-Heads for selective, parallel attention
6. Masking to ensure causality
We will explore these concepts in detail below.
Attention
We make use of the well-known attention concept, but modify it a little:

We use the attention mechanism as described above.
In addition, we apply self attention by computing the correlations
of tokens (words) to themselves.
To achieve this, we need to treat tokens as vectors by making use of an
embedding φ:
φ : {Words} −→ Rn
Published as a conference paper at ICLR 2015
Embedding
(a) (b)
Source: [BCB15]
Remark: Watch the reversed order (“la zone économique européenne” vs.
“the European economic area”)!
Positional Embedding
We want to keep track of the position (order) of elements. To this end,

we define the following mapping
(
i sin(ωk t) if i = 2k
x (t) := ,
cos(ωk t) if i = 2k + 1
where
1
ωk :=
100002k/d
Positional Embedding
Source: Transformer Architecture: The Positional Encoding, Amirhossein Kazemnejad
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
Query, Key, Value (Q, K , V ) concept
We introduce three weight matrices WQ , WK , and WV and reuse X three times:
Q := X WQ
K := X WK
V := X WV
Think of this as an information retrieval system.

Then the attention is computed as
QK T
attention(Q, K , V ) := softmax √ V.
dk
Intuition: The dot-product QK T computes the similarity between Q and K

<a,b>
based on the geometric interpretation: cos(φ]a,b ) = ||a||||b||
Multi Heads
Simple idea: To attend to different context, we just take multiple copies

of the attention mechanism.
I.e. we call a head the following expression
headi := attention(Qi , Ki , Vi ),
and
multi-head := concatenate(head1 , . . . , headn )
Masking to Ensure Causality
Problem:
The decoder is identical to the encoder, except that it is not allowed
to “look ahead”, i.e.
We do not want to know the tokens from the future, but rather
predict them from the past.
Solution:
In the decoder, the self-attention layer is only allowed to attend to
earlier, “past” positions in the output sequence.
This is achieved by masking future positions (setting them to − inf)
before the softmax sis applied.
Transformer Architecture
Figure 1: The Transformer - model architecture.

Source: [VSP+ 17]
wise fully connected feed-forward network. We employ a residual connection [10] around each of
Remark: We usually stack many of the encoders and decoders together.
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
Transformers and Hopffield
In 2020 Ramsauer et al. (including Sepp Hochreiter) [RSL+ 21] proved
that the Transformer Update Rule
QK T
attention(Q, K , V ) := softmax √ V
dk
is equivalent to the update rule for so-called “modern” Hopfield networks,
characterised by its Energy function (Free Energy)
1 1
E = −lse(β, X T ξ) + ξ t ξ + β −1 log N + M 2 ,
2 2
where lse denotes the log-sum-expression, i.e.
lse(β, x ) := −β log( exp(βx )).
P
Thus, in the end we came full circle. . .
References I
[BCB15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation

by jointly learning to align and translate,” in ICLR 2015, Jan.
2015, 3rd International Conference on Learning Representations,
ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
[CvMG+ 14] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk,

and Y. Bengio, “Learning phrase representations using RNN
encoder-decoder for statistical machine translation,” CoRR, vol.
abs/1406.1078, 2014. [Online]. Available:
http://arxiv.org/abs/1406.1078
[DHKS20] N. Damodaran, E. Haruni, M. Kokhkharova, and J. Schäfer,

“Device free human activity and fall recognition using wifi channel
state information (CSI),” CCF Transactions on Pervasive
Computing and Interaction, vol. 2, pp. 1–17, January 2020.
References II
[Hop82] J. J. Hopfield, “Neural networks and physical systems with

emergent collective computational abilities,” Proceedings of the
National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558,
1982. [Online]. Available:
https://www.pnas.org/content/79/8/2554
[HS97] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
[Online]. Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735
[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning

representations by back-propagating errors,” Nature, vol. 323, no.
6088, pp. 533–536, 1986. [Online]. Available:
https://doi.org/10.1038/323533a0
References III
[RSL+ 21] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler,
L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff,
D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and
S. Hochreiter, “Hopfield networks is all you need,” 2021.
[VSP+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.

Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Proceedings of the 31st International Conference on Neural
Information Processing Systems, ser. NIPS’17. Red Hook, NY,
USA: Curran Associates Inc., 2017, pp. 6000–6010.
[XBK+ 15] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,

R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
caption generation with visual attention,” CoRR, vol.
abs/1502.03044, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03044

Learning From Data: 12: Neural Networks - III

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning From Data: 12: Neural Networks - III

Uploaded by

Copyright:

Available Formats

Learning From Data

12: Neural Networks – III

Wissen durch Praxis stärkt

Attention and Transformers

“Classical” neural network architecture only addresses non- temporal

RNN architecture (enrolled):

RNN h = RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

Recurrence formula: ht = fW (ht−1 , xt ).

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

Sensor data classification / time series (many to many)

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

RNN h0 RNN h1 RNN h2 RNN … ht-1 RNN

ht = tanh(Whh ht−1 + Wxh xt )

Thus, if we repeat, we get

The many, repeated multiplications with W will blow up the gradient or

LSTMs solve the back propagation problem as ct is only multiplied

Figure: CSI values

zt = σ(Wh xt + Uh (rt ht−1 ))

Think of an GRU as a simplified LSTM as it lacks an output gate. They were

Focusing Learning by Attention

Source: [XBK+ 15]

has significantly improved the quality of caption genera-

two variants: a “hard” attention mechanism and a “soft” 2. Related Work

Source: [XBK+ 15]

Local Attention Global Attention

Before attention and transformers, Sequence to Sequence (Seq2Seq)

x = {x1 , . . . , xt } 7→ Encoder 7→ Decoder 7→ y = {y1 , . . . , yt0 }

We want to weight the sensitivities of the network to the input based

Transformers were defined in the influential paper “Attention is all you

ht+1 = σ(WE ht + QE xt+1 )

Problem: All the (unbounded) information in the sentence / sequence has

We make use of the well-known attention concept, but modify it a little:

We want to keep track of the position (order) of elements. To this end,

Source: Transformer Architecture: The Positional Encoding, Amirhossein Kazemnejad

Think of this as an information retrieval system.

Intuition: The dot-product QK T computes the similarity between Q and K

Simple idea: To attend to different context, we just take multiple copies

Figure 1: The Transformer - model architecture.

Thus, in the end we came full circle. . .

[BCB15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation

[CvMG+ 14] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk,

[DHKS20] N. Damodaran, E. Haruni, M. Kokhkharova, and J. Schäfer,

[Hop82] J. J. Hopfield, “Neural networks and physical systems with

[HS97] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning

[VSP+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.

[XBK+ 15] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,

You might also like