You are on page 1of 89

Long Short Term Memory Networks

Applications

Imran Siddiqi
Bahria University, Islamabad
imran.siddiqi@bahria.edu.pk

DSC 704 Deep Learning Bahria University, Spring 2019 1


Recap

DSC 704 Deep Learning Bahria University, Spring 2019 2


Recap

DSC 704 Deep Learning Bahria University, Spring 2019 3


LSTMs for Sequence Classification

Many-to-one mapping

DSC 704 Deep Learning Bahria University, Spring 2019 4


Recall – Sentiment Classification with RNN

DSC 704 Deep Learning Bahria University, Spring 2019 5


Sentiment Classification with LSTM

DSC 704 Deep Learning Bahria University, Spring 2019 6


LSTM for Sequence Classification on Images

Feed columns/rows of image one by one as a sequence (of length 28)

DSC 704 Deep Learning Bahria University, Spring 2019 7


LSTM for Sequence Classification on Images

Feed each row of the image as a sequence of 28 values


28 rows = 28 time steps
Each row has 28 values = feature vector size

DSC 704 Deep Learning Bahria University, Spring 2019 8


LSTM for Sequence Classification on Images
• Can have multiple rows/cols – For instance a sequence of 2 rows
• 2 rows = 2x28 = 56 values or features
• There will be 14 time steps

DSC 704 Deep Learning Bahria University, Spring 2019 9


LSTM for Sequence Classification on Images
• Can build a deeper NW

DSC 704 Deep Learning Bahria University, Spring 2019 10


C-RNN
• Why not use machine learned features rather than pixel values as
input to LSTM?
• Feed the image to a ConvNet
• Convert the output feature maps to sequences
• Feed these sequences to an LSTM

DSC 704 Deep Learning Bahria University, Spring 2019 11


C-RNN – Classification on CIFAR Dataset
• Load and Prepare Data

• Create Model

DSC 704 Deep Learning Bahria University, Spring 2019 12


C-RNN – Classification on CIFAR Dataset
• Train and Evaluate the model

DSC 704 Deep Learning Bahria University, Spring 2019 13


Sequence to
Sequence
Mapping

Many-to-many mapping

DSC 704 Deep Learning Bahria University, Spring 2019 14


Sequence to Sequence Mapping
• Machine Translation
• Speech Recognition
• Text/Handwriting Recognition

DSC 704 Deep Learning Bahria University, Spring 2019 15


Machine Translation

DSC 704 Deep Learning Bahria University, Spring 2019 16


Machine Translation

DSC 704 Deep Learning Bahria University, Spring 2019 17


Machine Translation

• Encoder Network:
• RNN/LSTM
• Feed one word at a time
• RNN outputs a vector that
represents the sentence

DSC 704 Deep Learning Bahria University, Spring 2019 18


Machine Translation
• Decoder Network:
• Takes output of encoder as input
• Can be trained to output the translation in English – One
word at a time
• Given sufficient pairs of French and English
sentences – Model works fairly well

DSC 704 Deep Learning Bahria University, Spring 2019 19


Machine Translation

DSC 704 Deep Learning Bahria University, Spring 2019 20


Smart Reply /Chatbots
• Instead of translation – Train with possible answers

DSC 704 Deep Learning Bahria University, Spring 2019 21


Image Captioning
• Architecture very similar to machine translation also works of image
captioning

DSC 704 Deep Learning Bahria University, Spring 2019 22


Image Captioning
• Using a CNN (for example AlexNet) we can learn an encoding (feature
representation of a given image

DSC 704 Deep Learning Bahria University, Spring 2019 23


Image Captioning

DSC 704 Deep Learning Bahria University, Spring 2019 24


Image Captioning

Drop the softmax layer and feed the encoding to an RNN which predicts caption – one word at a time
DSC 704 Deep Learning Bahria University, Spring 2019 25
Machine Translation
• There can be many possible
translations
• How to pick?
• Greedy Search:
• Generate 1 word at a time ,
generating the first most likely
word , then generating the
next most likely word and so
on…..

DSC 704 Deep Learning Bahria University, Spring 2019 26


Machine Translation – Greedy Search

In making predictions – Pick the most likely word


at each time-step

DSC 704 Deep Learning Bahria University, Spring 2019 27


Machine Translation – Greedy Search

DSC 704 Deep Learning Bahria University, Spring 2019 28


Machine Translation – Greedy Search
• In each step we only consider
the most likely output but this
not always guarantee the best
solution

The probability of the word “going” after “is” is larger than


the word “visiting”

DSC 704 Deep Learning Bahria University, Spring 2019 29


Machine Translation – Greedy Search

DSC 704 Deep Learning Bahria University, Spring 2019 30


Machine Translation – Beam Search
• Considere multiple options in every step , not just one option
• These number of options is controlled by a variable called Beam
Width

DSC 704 Deep Learning Bahria University, Spring 2019 31


DSC 704 Deep Learning Bahria University, Spring 2019 32
Machine Translation – Beam Search

Step 1:
Define Beam Width (e.g. B=3) and instead of one
most probably output pick three most probably
outputs for the first word

DSC 704 Deep Learning Bahria University, Spring 2019 33


Machine Translation – Beam Search
Step 2:
For each of the three words consider what should
be the next word 
     
, "")



  ,  
)=  
) .  
, "")

DSC 704 Deep Learning Bahria University, Spring 2019 34


Machine Translation – Beam Search
Step 2:
For each of the three words consider what should
be the next word   
, "")
   



  ,  
)=  
) .  
, "")

DSC 704 Deep Learning Bahria University, Spring 2019 35


Machine Translation – Beam Search
Step 2:
For each of the three words consider what should
be the next word   
, "")
   



  ,  
)=  
) .  
, "")

DSC 704 Deep Learning Bahria University, Spring 2019 36


Machine Translation – Beam Search
If the first word is “in”, there are 10,000 choices for the second
word – Same for “jane” and “september”
There will be 30,000 possible values of   ,  
)
Beam width = 3; Pick three most likely choices:
• In-september
• Jane-in
• Jane-visits

If beam search makes these three choices it is rejecting


“September” as the first word

DSC 704 Deep Learning Bahria University, Spring 2019 37


Machine Translation – Beam Search  
, " ")

 
, " ")

 
, " ")

DSC 704 Deep Learning Bahria University, Spring 2019 38


Machine Translation – Beam Search
Beam Search can generate multiple
samples, and the best of those are often
better than samples created using greedy
search.

DSC 704 Deep Learning Bahria University, Spring 2019 39


Time-Steps
The cat is red ….
T=0
0.4 0.0 0.3 0.0 ….

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 40


Beam Width = 2

Time-Steps The: 0.4


is: 0.3
The cat is red ….
T=0
0.4 0.0 0.3 0.0 ….

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 41


Time-Steps The-The = 0.04
The-cat =0.08
The cat is red ….
T=0
0.4 0.0 0.3 0.0 ….

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 42


Time-Steps The-The = 0.04
The-cat =0.08
The cat is red …. is-cat: 0.06
T=0
0.4 0.0 0.3 0.0 …. is-The: 0.03

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 43


Time-Steps The-The = 0.04
The-cat =0.08
The cat is red …. is-cat: 0.06
T=0
0.4 0.0 0.3 0.0 …. is-The: 0.03

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 44


Time-Steps The-cat-is =0.064
The-cat-red=0.008
The cat is red …. is-cat-is: 0.048
T=0
0.4 0.0 0.3 0.0 …. is-cat-red: 0.006

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 45


Time-Steps The-cat-is =0.064
The-cat-red=0.008
The cat is red …. is-cat-is: 0.048
T=0
0.4 0.0 0.3 0.0 …. is-cat-red: 0.006

The cat is red …..


T=1
0.1 0.2 0.0 0.0 …..

The cat is red ….


T=2 0.0 0.8 0.1 ….
0.0

The cat is red ….


T=3 0.2 0.0 0.8 ….
0.0

DSC 704 Deep Learning Bahria University, Spring 2019 46


TREE REPRESENTATION

Time-Steps
The cat is red ….
T=0 0.4 0.0 0.3 0.0 ….

DSC 704 Deep Learning Bahria University, Spring 2019 47


Time-Steps

The cat is red ….


T=0 0.4 0.0 0.3 0.0 ….

DSC 704 Deep Learning Bahria University, Spring 2019 48


Time-Steps

The cat is red ….


T=0 0.4 0.0 0.3 0.0 ….

The cat is red ….. The cat is red …..


T=1
0.1 0.2 0.0 0.0 ….. 0.1 0.2 0.0 0.0 …..

0.04 0.08 0.0 0.0 …. 0.03 0.06 0.0 0.0 ….

DSC 704 Deep Learning Bahria University, Spring 2019 49


Time-Steps

The cat is red ….


T=0 0.4 0.0 0.3 0.0 ….

The cat is red ….. The cat is red …..


T=1
0.1 0.2 0.0 0.0 ….. 0.1 0.2 0.0 0.0 …..

0.04 0.08 0.0 0.0 …. 0.03 0.06 0.0 0.0 ….

DSC 704 Deep Learning Bahria University, Spring 2019 50


Encode-Decoder LSTM in Keras

DSC 704 Deep Learning Bahria University, Spring 2019 51


Machine Translation - Implementation

DSC 704 Deep Learning Bahria University, Spring 2019 52


Machine Translation - Implementation
• Training Samples

DSC 704 Deep Learning Bahria University, Spring 2019 53


Machine Translation – Sample Output

DSC 704 Deep Learning Bahria University, Spring 2019 54


Generative Model: Text Generation

DSC 704 Deep Learning Bahria University, Spring 2019 55


Text Generation using RNNs/LSTMs
• What is the next word of this following sentence :
• “The man is walking down ________” ?
• Given a dictionary containing all potential words, the neural network will
take the sequence of words as seed input : 1: “the”, 2: “man”, 3: “is”, …
• Its output will be a matrix providing the probability for each word from the
dictionary to be the next one of the given sequence.
• Based on the training data, it could maybe guess the next word will be
"the"…
• Generating text ?
• Simply by iterating the process. Once the next word is drawn from the dictionary, we
add it at the end of the sequence. Then, we guess a new word for this new sequence

DSC 704 Deep Learning Bahria University, Spring 2019 56


Text Generation with RNNs/LSTMs
• Word Level

DSC 704 Deep Learning Bahria University, Spring 2019 57


Text Generation using RNNs/LSTMs
• Character Level – Learn Dependencies between characters
• Source Text: Alice in Wonderland
• Load the ASCII text for the book into memory and convert all of the
characters to lowercase to reduce the vocabulary that the network
must learn.

Example Credit: Text Generation With LSTM Recurrent Neural Networks in Python with Keras, by Jason Brownlee

DSC 704 Deep Learning Bahria University, Spring 2019 58


• Convert the characters to integers by first creating a set of all of the distinct characters in the book, then
creating a map of each character to a unique integer.

• Split the book text up into subsequences with a fixed length of 100 characters, an arbitrary
length.
• Each training pattern of the network is comprised of 100 time steps of one character (X)
followed by one character output (y).
• When creating these sequences, we slide this window along the whole book one character
at a time, allowing each character a chance to be learned from the 100 characters that
preceded it (except the first 100 characters)
• For example, if the sequence length is 5 (for simplicity) then the first two training patterns
would be:

DSC 704 Deep Learning Bahria University, Spring 2019 59


• Prepare the data – Under 150,000 training patterns

• Normalize the input and convert the output to one-hot-encoding

DSC 704 Deep Learning Bahria University, Spring 2019 60


• We define a single hidden LSTM layer with 256 memory units. The network uses
dropout with a probability of 20.
• The output layer is a Dense layer using the softmax activation function to output
a probability prediction for each of the 47 characters between 0 and 1.

• Train the model

• There is no test dataset. We are modeling the entire training dataset to learn the probability of each
character in a sequence.

DSC 704 Deep Learning Bahria University, Spring 2019 61


• Text Generation: The simplest way to use the Keras LSTM model to make predictions is to first start off with a
seed sequence as input, generate the next character then update the seed sequence to add the generated
character on the end and trim off the first character.
• This process is repeated for as long as we want to predict new characters (e.g. a sequence of 1,000
characters in length).
• We can pick a random input pattern as our seed sequence, then print generated characters as we generate
them.

DSC 704 Deep Learning Bahria University, Spring 2019 62


Sample training text from the book

Seed Text

Generated Text

DSC 704 Deep Learning Bahria University, Spring 2019 63


Observations
• The generated text generally conforms to the line format observed in
the original text of less than 80 characters before a new line.
• The characters are separated into word-like groups and most groups
are actual English words (e.g. “the”, “little” and “was”), but many do
not (e.g. “lott”, “tiie” and “taede”).
• Some of the words in sequence make sense(e.g. “and the white
rabbit“), but many do not (e.g. “wese tilel“).

DSC 704 Deep Learning Bahria University, Spring 2019 64


Deeper Network
• Much better – Naturally as a whole would not make much sense

DSC 704 Deep Learning Bahria University, Spring 2019 65


Sequence Modeling with CTC

DSC 704 Deep Learning Bahria University, Spring 2019 66


Sequence Modeling with CTC
• Audio clips and transcriptions
• How the characters in the transcript align to the audio?
• Solutions:
• Rules: “One character corresponds to ten inputs”.
• People’s rates of speech vary, so this type of rule can always be broken.
• Hand-align each character to its location in the audio
• Prohibitively time consuming

DSC 704 Deep Learning Bahria University, Spring 2019 67


Sequence Modeling with CTC
• Handwriting Recognition
• We could create a data-set with images of text-lines, and then specify for
each horizontal position of the image the corresponding character
• Then, we could train a NN to output a character-score for each horizontal
position.
• Not Practical
• Post Recognition:
• A single character can span multiple horizontal positions, e.g. we could get “
ttooo” because the “o” is a wide character
• We have to remove all duplicate “t”s and “o”s.
• What if the recognized text would have been “too”? Then removing all
duplicate “o”s gets us the wrong result.

DSC 704 Deep Learning Bahria University, Spring 2019 68


Connectionist Temporal Classification (CTC)
• Connectionist Temporal Classification (CTC) is a way to get around not
knowing the alignment between the input and the output.
• It’s especially well suited to applications like speech and handwriting
recognition.

DSC 704 Deep Learning Bahria University, Spring 2019 69


Handwriting Recognition with CTC

DSC 704 Deep Learning Bahria University, Spring 2019 70


Handwriting Recognition with CTC

DSC 704 Deep Learning Bahria University, Spring 2019 71


Handwriting Recognition with CTC
Feed the first vector
(t=0) to LSTM

Alphabet ={“a”, “e”, “l”, “p”, “z”, “-”}

“-” is a special symbol (blank) that


we always add to the alphabet
(Explained later)

DSC 704 Deep Learning Bahria University, Spring 2019 72


Handwriting Recognition with CTC
Feed the subsequent
vectors

DSC 704 Deep Learning Bahria University, Spring 2019 73


Handwriting Recognition with CTC

CTC

DSC 704 Deep Learning Bahria University, Spring 2019 74


Best Path Decoding

CTC Decoding

ap-pl-ee
Remove repeated symbols

ap-pl-e
Remove blanks

apple

DSC 704 Deep Learning Bahria University, Spring 2019 75


Encoding the Text
• Introducing a pseudo-character (called blank, not to be confused it
with a “real” blank, i.e. a white-space character).
• This special character will be denoted as “-” in the following example.
• We use a coding scheme to solve the duplicate-character problem -
When encoding a text:
• We can insert arbitrary many blanks at any position, which will be removed
when decoding it.
• We must insert a blank between duplicate characters like in “apple”.
• We can repeat each character as often as we like.

DSC 704 Deep Learning Bahria University, Spring 2019 76


Encoding the Text
• This scheme also allows us to easily create different alignments of
the same text, e.g. “t-o” and “too” and “-to” all represent the same
text (“to”), but with different alignments to the image.
• The NN is trained to output an encoded text (encoded in the NN
output matrix).

DSC 704 Deep Learning Bahria University, Spring 2019 77


CTC Decoding – Best Path Decoding
• The output of the NN is a matrix containing character-probabilities for
each time-step (horizontal position)
• This matrix must be decoded to get the final text.
• A simple and very fast algorithm is best path decoding (greedy
search) which consists of two steps:
• It calculates the best path by taking the most likely character per time-step.
• It undoes the encoding by first removing duplicate characters and then
removing all blanks from the path. What remains represents the recognized
text.

DSC 704 Deep Learning Bahria University, Spring 2019 78


CTC Decoding
• Example:The characters are “a”, “b” and “-”
(blank). There are 5 time-steps.
• The most likely character of t0 is “a”, the
same applies for t1 and t2.
• The blank character has the highest score at
t3.
• Finally, “b” is most likely at t4.
• This gives us the path “aaa-b”.
• We remove duplicate characters, this yields
“a-b”, and then we remove any blank from
the remaining path, which gives us the text
“ab” which we output as the recognized text.

DSC 704 Deep Learning Bahria University, Spring 2019 79


CTC Decoding
• The output of the model may have 32
time-steps, but the output might not
have 32 characters.
• The CTC cost function allows the RNN to
generate output like:

DSC 704 Deep Learning Bahria University, Spring 2019 80


How to train the System?

CTC Loss

DSC 704 Deep Learning Bahria University, Spring 2019 81


CTC Loss

DSC 704 Deep Learning Bahria University, Spring 2019 82


Training: CTC Loss
• We only have to tell the CTC loss function the text that occurs in the
image.
• We ignore both the position and width of the characters in the image.
• No further processing of the recognized text is needed.

DSC 704 Deep Learning Bahria University, Spring 2019 83


Training: CTC Loss
• There is no need to annotate the images at each horizontal position
(time-step).
• The NN-training is guided by the CTC loss function.
• We only feed the output matrix of the NN and the corresponding
ground-truth (GT) text to the CTC loss function.
• The network does not know where each character occurs
• Instead, it tries all possible alignments of the GT text in the image and
takes the sum of all scores.
• This way, the score of a GT text is high if the sum over the alignment-
scores has a high value

DSC 704 Deep Learning Bahria University, Spring 2019 84


CTC Loss Calculation

DSC 704 Deep Learning Bahria University, Spring 2019 85


Loss Calculation
• We need to calculate the loss value for given pairs of
images and GT texts to train the NN.
• The NN outputs a matrix containing a score for each
character at each time-step.
• In the figure, there are two time-steps (t0, t1) and three
characters (“a”, “b” and the blank “-”). The character-
scores sum to 1 for each time-step.
• The loss is calculated by summing up all scores of all
possible alignments of the GT text, this way it does not
matter where the text appears in the image.

DSC 704 Deep Learning Bahria University, Spring 2019 86


Loss Calculation
• The score for one alignment (or path) is calculated by multiplying the
corresponding character scores together.
• Example: Assume the GT text is “a”
• Calculate all possible paths of length 2 (because the matrix has 2 time-steps),
which produce as “a”
• “aa”, “a-” and “-a”
• Scores:
• “aa”: 0.4 x 0.4=0.16
• “a-”: 0.4 x 0.6=0.24
• “-a”: 0.6 x 0.4=0.24
• To get the score for a given GT text, we sum over the scores of all paths
corresponding to this text.
• Score for GT text “a”: 0.16 + 0.24 + 0.24 = 0.64
• If the GT text is assumed to be “”, we see that there is only one corresponding
path, namely “--”, which yields the overall score of 0.6·0.6=0.36.

DSC 704 Deep Learning Bahria University, Spring 2019 87


Loss Calculation
• We calculated the probability of a GT text, but not the loss.
• The loss simply is the negative logarithm of the probability.
• The loss value is back-propagated through the NN and the parameters
of the NN are updated according to the optimizer used

DSC 704 Deep Learning Bahria University, Spring 2019 88


CTC Decoding – Beam Search
• Why best path decoding can fail?
• Example: Decoded text: “- -”: “”
• The probability of the text “a” is the sum over all
probabilities of the paths ‘aa’, ‘a-’ and ‘-a’:
0.2x0.4+0.2x0.6+0.8x0.4=0.52.
• “a” is more probable than “” (0.52>0.48).
• Beam Search Decoding is preferred over Best
Path

DSC 704 Deep Learning Bahria University, Spring 2019 89

You might also like