You are on page 1of 21

Issue: Standard RNNs have poor memory

▪ Transition Matrix necessarily weakens signal


▪ Need a structure that can leave some dimensions unchanged over many steps
▪ This is the problem addressed by so-called Long-Short Term Memory RNNs (LSTM)

2
Idea: Make “remembering” easy
▪ Define a more complicated update mechanism for the changing of the internal state
▪ By default, LSTMs remember the information from the last step
▪ Items are overwritten as an active choice

3
LSTM diagram
output

𝑐𝑡−1 𝑐𝑡 cell state

output (which also


ℎ𝑡−1 ℎ𝑡
feeds into next unit)

input
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

4
LSTM diagram

cell state gets updated


𝑐𝑡−1 𝑐𝑡 in two stages

ℎ𝑡−1 ℎ𝑡

5
LSTM diagram
Decide what
to “forget”

cell state gets updated


𝑐𝑡−1 𝑐𝑡 in two stages

ℎ𝑡−1 ℎ𝑡

6
LSTM diagram
Add in “new”
information

cell state gets


𝑐𝑡−1 𝑐𝑡 updated in two stages

ℎ𝑡−1 ℎ𝑡

7
LSTM diagram
Decide what
to “forget”

ℎ𝑡−1
𝑐𝑡−1 𝑐𝑡 ℎ𝑡−1

𝑥𝑡
ℎ𝑡−1 ℎ𝑡 𝑥𝑡

ℎ𝑡−1 , 𝑥𝑡
concatenation of vectors

8
LSTM diagram
Decide what
to “forget”

𝑐𝑡−1 𝑐𝑡

𝑓𝑡 = 𝜎 𝑊𝑓 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓

ℎ𝑡−1 ℎ𝑡 based on previous output


and current input

9
LSTM diagram
Decide what
to “forget”

𝑐𝑡−1 𝑐𝑡

𝑓𝑡 = 𝜎 𝑊𝑓 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓

ℎ𝑡−1 ℎ𝑡 based on previous output


and current input

10
LSTM diagram
Add in “new”
information

𝑐𝑡−1 𝑐𝑡
𝑖𝑡 = 𝜎 𝑊𝑖 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖
𝐶𝑡′ = 𝑡𝑎𝑛ℎ 𝑊𝐶 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶
ℎ𝑡−1 ℎ𝑡

11
LSTM diagram
Add in “new”
information

𝑐𝑡−1 𝑐𝑡
𝑖𝑡 = 𝜎 𝑊𝑖 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖
𝐶𝑡′ = 𝑡𝑎𝑛ℎ 𝑊𝐶 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶
ℎ𝑡−1 ℎ𝑡

12
LSTM diagram
Add in “new”
information

𝑐𝑡−1 𝑐𝑡
𝑖𝑡 = 𝜎 𝑊𝑖 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖
𝐶𝑡′ = 𝑡𝑎𝑛ℎ 𝑊𝐶 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝐶
ℎ𝑡−1 ℎ𝑡

13
LSTM diagram

Note: ’∗‘ represents


𝑐𝑡−1 𝑐𝑡 element-wise multiplication

𝐶𝑡 = 𝑓𝑖 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶𝑡′
ℎ𝑡−1 ℎ𝑡

forget add
the old the new
(or not) (or not)

14
LSTM diagram

Note: ’∗‘ represents


𝑐𝑡−1 𝑐𝑡 element-wise multiplication

𝐶𝑡 = 𝑓𝑖 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶𝑡′
ℎ𝑡−1 ℎ𝑡

forget add
the old the new
(or not) (or not)

15
LSTM diagram

Note: ’∗‘ represents


𝑐𝑡−1 𝑐𝑡 element-wise multiplication

𝐶𝑡 = 𝑓𝑖 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶𝑡′
ℎ𝑡−1 ℎ𝑡

forget add
the old the new
(or not) (or not)

16
LSTM diagram

Final stage computes


the output
𝑐𝑡−1 𝑐𝑡
𝑜𝑡 = 𝜎 𝑊𝑜 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜
ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )
ℎ𝑡−1 ℎ𝑡

17
LSTM diagram

Final stage computes


the output
𝑐𝑡−1 𝑐𝑡
𝑜𝑡 = 𝜎 𝑊𝑜 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜
ℎ𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡 )
ℎ𝑡−1 ℎ𝑡

Note: No weights here

18
LSTM unrolled

19
Final Points
▪ This is the most common version of LSTM, but there are many different “flavors”
– Gated Recurrent Unit (GRU)
– Depth-Gated RNN
▪ LSTMs have considerably more parameters than plain RNNs
▪ Most of the big performance improvements in NLP have come from LSTMs, not
plain RNN

20

You might also like