LST Ms Applications

Long Short Term Memory Networks
Applications
Imran Siddiqi
Bahria University, Islamabad
imran.siddiqi@bahria.edu.pk
DSC 704 Deep Learning Bahria University, Spring 2019 1

Recap

Recap

LSTMs for Sequence Classification
Many-to-one mapping

Recall – Sentiment Classification with RNN

Sentiment Classification with LSTM

LSTM for Sequence Classification on Images
Feed columns/rows of image one by one as a sequence (of length 28)

Feed each row of the image as a sequence of 28 values

28 rows = 28 time steps
Each row has 28 values = feature vector size

• Can have multiple rows/cols – For instance a sequence of 2 rows
• 2 rows = 2x28 = 56 values or features
• There will be 14 time steps

• Can build a deeper NW

C-RNN
• Why not use machine learned features rather than pixel values as
input to LSTM?
• Feed the image to a ConvNet
• Convert the output feature maps to sequences
• Feed these sequences to an LSTM

C-RNN – Classification on CIFAR Dataset
• Load and Prepare Data
• Create Model

C-RNN – Classification on CIFAR Dataset
• Train and Evaluate the model

Sequence to
Sequence
Mapping
Many-to-many mapping

Sequence to Sequence Mapping
• Machine Translation
• Speech Recognition
• Text/Handwriting Recognition

Machine Translation

Machine Translation

Machine Translation
• Encoder Network:
• RNN/LSTM
• Feed one word at a time
• RNN outputs a vector that
represents the sentence

Machine Translation
• Decoder Network:
• Takes output of encoder as input
• Can be trained to output the translation in English – One
word at a time
• Given sufficient pairs of French and English
sentences – Model works fairly well

Machine Translation

Smart Reply /Chatbots
• Instead of translation – Train with possible answers

Image Captioning
• Architecture very similar to machine translation also works of image
captioning

Image Captioning
• Using a CNN (for example AlexNet) we can learn an encoding (feature
representation of a given image

Image Captioning

Image Captioning
Drop the softmax layer and feed the encoding to an RNN which predicts caption – one word at a time
Machine Translation
• There can be many possible
translations
• How to pick?
• Greedy Search:
• Generate 1 word at a time ,
generating the first most likely
word , then generating the
next most likely word and so
on…..

Machine Translation – Greedy Search
In making predictions – Pick the most likely word

at each time-step


• In each step we only consider
the most likely output but this
not always guarantee the best
solution
The probability of the word “going” after “is” is larger than

the word “visiting”


Machine Translation – Beam Search
• Considere multiple options in every step , not just one option
• These number of options is controlled by a variable called Beam
Width

Step 1:
Define Beam Width (e.g. B=3) and instead of one
most probably output pick three most probably
outputs for the first word

Step 2:
For each of the three words consider what should
be the next word

, "")

,
)=
) .
, "")

Step 2:
be the next word
, "")

,
)=
) .
, "")

Step 2:
be the next word
, "")

,
)=
) .
, "")

If the first word is “in”, there are 10,000 choices for the second
word – Same for “jane” and “september”
There will be 30,000 possible values of ,
)
Beam width = 3; Pick three most likely choices:
• In-september
• Jane-in
• Jane-visits
If beam search makes these three choices it is rejecting

“September” as the first word

, " ")

, " ")

, " ")

Beam Search can generate multiple
samples, and the best of those are often
better than samples created using greedy
search.

Time-Steps
The cat is red ….
T=0
0.4 0.0 0.3 0.0 ….
The cat is red …..

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

Beam Width = 2
Time-Steps The: 0.4

is: 0.3
The cat is red ….
T=0
0.4 0.0 0.3 0.0 ….

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

Time-Steps The-The = 0.04
The-cat =0.08
The cat is red ….
T=0
0.4 0.0 0.3 0.0 ….

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

The-cat =0.08
The cat is red …. is-cat: 0.06
T=0
0.4 0.0 0.3 0.0 …. is-The: 0.03

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

The-cat =0.08
The cat is red …. is-cat: 0.06
T=0
0.4 0.0 0.3 0.0 …. is-The: 0.03

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

Time-Steps The-cat-is =0.064
The-cat-red=0.008
The cat is red …. is-cat-is: 0.048
T=0
0.4 0.0 0.3 0.0 …. is-cat-red: 0.006

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

Time-Steps The-cat-is =0.064
The-cat-red=0.008
The cat is red …. is-cat-is: 0.048
T=0
0.4 0.0 0.3 0.0 …. is-cat-red: 0.006

T=1
0.1 0.2 0.0 0.0 …..
The cat is red ….

T=2 0.0 0.8 0.1 ….
0.0
The cat is red ….

T=3 0.2 0.0 0.8 ….
0.0

TREE REPRESENTATION
Time-Steps
The cat is red ….
T=0 0.4 0.0 0.3 0.0 ….

Time-Steps
The cat is red ….

T=0 0.4 0.0 0.3 0.0 ….

Time-Steps
The cat is red ….

T=0 0.4 0.0 0.3 0.0 ….
The cat is red ….. The cat is red …..

T=1
0.1 0.2 0.0 0.0 ….. 0.1 0.2 0.0 0.0 …..
0.04 0.08 0.0 0.0 …. 0.03 0.06 0.0 0.0 ….

Time-Steps
The cat is red ….

T=0 0.4 0.0 0.3 0.0 ….
The cat is red ….. The cat is red …..

T=1
0.1 0.2 0.0 0.0 ….. 0.1 0.2 0.0 0.0 …..
0.04 0.08 0.0 0.0 …. 0.03 0.06 0.0 0.0 ….

Encode-Decoder LSTM in Keras

Machine Translation - Implementation

Machine Translation - Implementation
• Training Samples

Machine Translation – Sample Output

Generative Model: Text Generation

Text Generation using RNNs/LSTMs
• What is the next word of this following sentence :
• “The man is walking down ________” ?
• Given a dictionary containing all potential words, the neural network will
take the sequence of words as seed input : 1: “the”, 2: “man”, 3: “is”, …
• Its output will be a matrix providing the probability for each word from the
dictionary to be the next one of the given sequence.
• Based on the training data, it could maybe guess the next word will be
"the"…
• Generating text ?
• Simply by iterating the process. Once the next word is drawn from the dictionary, we
add it at the end of the sequence. Then, we guess a new word for this new sequence

Text Generation with RNNs/LSTMs
• Word Level

Text Generation using RNNs/LSTMs
• Character Level – Learn Dependencies between characters
• Source Text: Alice in Wonderland
• Load the ASCII text for the book into memory and convert all of the
characters to lowercase to reduce the vocabulary that the network
must learn.
Example Credit: Text Generation With LSTM Recurrent Neural Networks in Python with Keras, by Jason Brownlee

• Convert the characters to integers by first creating a set of all of the distinct characters in the book, then
creating a map of each character to a unique integer.
• Split the book text up into subsequences with a fixed length of 100 characters, an arbitrary
length.
• Each training pattern of the network is comprised of 100 time steps of one character (X)
followed by one character output (y).
• When creating these sequences, we slide this window along the whole book one character
at a time, allowing each character a chance to be learned from the 100 characters that
preceded it (except the first 100 characters)
• For example, if the sequence length is 5 (for simplicity) then the first two training patterns
would be:

• Prepare the data – Under 150,000 training patterns
• Normalize the input and convert the output to one-hot-encoding

• We define a single hidden LSTM layer with 256 memory units. The network uses
dropout with a probability of 20.
• The output layer is a Dense layer using the softmax activation function to output
a probability prediction for each of the 47 characters between 0 and 1.
• Train the model
• There is no test dataset. We are modeling the entire training dataset to learn the probability of each
character in a sequence.

• Text Generation: The simplest way to use the Keras LSTM model to make predictions is to first start off with a
seed sequence as input, generate the next character then update the seed sequence to add the generated
character on the end and trim off the first character.
• This process is repeated for as long as we want to predict new characters (e.g. a sequence of 1,000
characters in length).
• We can pick a random input pattern as our seed sequence, then print generated characters as we generate
them.

Sample training text from the book
Seed Text
Generated Text

Observations
• The generated text generally conforms to the line format observed in
the original text of less than 80 characters before a new line.
• The characters are separated into word-like groups and most groups
are actual English words (e.g. “the”, “little” and “was”), but many do
not (e.g. “lott”, “tiie” and “taede”).
• Some of the words in sequence make sense(e.g. “and the white
rabbit“), but many do not (e.g. “wese tilel“).

Deeper Network
• Much better – Naturally as a whole would not make much sense

Sequence Modeling with CTC

• Audio clips and transcriptions
• How the characters in the transcript align to the audio?
• Solutions:
• Rules: “One character corresponds to ten inputs”.
• People’s rates of speech vary, so this type of rule can always be broken.
• Hand-align each character to its location in the audio
• Prohibitively time consuming

• Handwriting Recognition
• We could create a data-set with images of text-lines, and then specify for
each horizontal position of the image the corresponding character
• Then, we could train a NN to output a character-score for each horizontal
position.
• Not Practical
• Post Recognition:
• A single character can span multiple horizontal positions, e.g. we could get “
ttooo” because the “o” is a wide character
• We have to remove all duplicate “t”s and “o”s.
• What if the recognized text would have been “too”? Then removing all
duplicate “o”s gets us the wrong result.

Connectionist Temporal Classification (CTC)
• Connectionist Temporal Classification (CTC) is a way to get around not
knowing the alignment between the input and the output.
• It’s especially well suited to applications like speech and handwriting
recognition.

Handwriting Recognition with CTC


Feed the first vector
(t=0) to LSTM
Alphabet ={“a”, “e”, “l”, “p”, “z”, “-”}
“-” is a special symbol (blank) that

we always add to the alphabet
(Explained later)

Feed the subsequent
vectors

CTC

Best Path Decoding
CTC Decoding
ap-pl-ee
Remove repeated symbols
ap-pl-e
Remove blanks
apple

Encoding the Text
• Introducing a pseudo-character (called blank, not to be confused it
with a “real” blank, i.e. a white-space character).
• This special character will be denoted as “-” in the following example.
• We use a coding scheme to solve the duplicate-character problem -
When encoding a text:
• We can insert arbitrary many blanks at any position, which will be removed
when decoding it.
• We must insert a blank between duplicate characters like in “apple”.
• We can repeat each character as often as we like.

Encoding the Text
• This scheme also allows us to easily create different alignments of
the same text, e.g. “t-o” and “too” and “-to” all represent the same
text (“to”), but with different alignments to the image.
• The NN is trained to output an encoded text (encoded in the NN
output matrix).

CTC Decoding – Best Path Decoding
• The output of the NN is a matrix containing character-probabilities for
each time-step (horizontal position)
• This matrix must be decoded to get the final text.
• A simple and very fast algorithm is best path decoding (greedy
search) which consists of two steps:
• It calculates the best path by taking the most likely character per time-step.
• It undoes the encoding by first removing duplicate characters and then
removing all blanks from the path. What remains represents the recognized
text.

CTC Decoding
• Example:The characters are “a”, “b” and “-”
(blank). There are 5 time-steps.
• The most likely character of t0 is “a”, the
same applies for t1 and t2.
• The blank character has the highest score at
t3.
• Finally, “b” is most likely at t4.
• This gives us the path “aaa-b”.
• We remove duplicate characters, this yields
“a-b”, and then we remove any blank from
the remaining path, which gives us the text
“ab” which we output as the recognized text.

CTC Decoding
• The output of the model may have 32
time-steps, but the output might not
have 32 characters.
• The CTC cost function allows the RNN to
generate output like:

How to train the System?
CTC Loss

CTC Loss

Training: CTC Loss
• We only have to tell the CTC loss function the text that occurs in the
image.
• We ignore both the position and width of the characters in the image.
• No further processing of the recognized text is needed.

Training: CTC Loss
• There is no need to annotate the images at each horizontal position
(time-step).
• The NN-training is guided by the CTC loss function.
• We only feed the output matrix of the NN and the corresponding
ground-truth (GT) text to the CTC loss function.
• The network does not know where each character occurs
• Instead, it tries all possible alignments of the GT text in the image and
takes the sum of all scores.
• This way, the score of a GT text is high if the sum over the alignment-
scores has a high value

CTC Loss Calculation

Loss Calculation
• We need to calculate the loss value for given pairs of
images and GT texts to train the NN.
• The NN outputs a matrix containing a score for each
character at each time-step.
• In the figure, there are two time-steps (t0, t1) and three
characters (“a”, “b” and the blank “-”). The character-
scores sum to 1 for each time-step.
• The loss is calculated by summing up all scores of all
possible alignments of the GT text, this way it does not
matter where the text appears in the image.

Loss Calculation
• The score for one alignment (or path) is calculated by multiplying the
corresponding character scores together.
• Example: Assume the GT text is “a”
• Calculate all possible paths of length 2 (because the matrix has 2 time-steps),
which produce as “a”
• “aa”, “a-” and “-a”
• Scores:
• “aa”: 0.4 x 0.4=0.16
• “a-”: 0.4 x 0.6=0.24
• “-a”: 0.6 x 0.4=0.24
• To get the score for a given GT text, we sum over the scores of all paths
corresponding to this text.
• Score for GT text “a”: 0.16 + 0.24 + 0.24 = 0.64
• If the GT text is assumed to be “”, we see that there is only one corresponding
path, namely “--”, which yields the overall score of 0.6·0.6=0.36.

Loss Calculation
• We calculated the probability of a GT text, but not the loss.
• The loss simply is the negative logarithm of the probability.
• The loss value is back-propagated through the NN and the parameters
of the NN are updated according to the optimizer used

CTC Decoding – Beam Search
• Why best path decoding can fail?
• Example: Decoded text: “- -”: “”
• The probability of the text “a” is the sum over all
probabilities of the paths ‘aa’, ‘a-’ and ‘-a’:
0.2x0.4+0.2x0.6+0.8x0.4=0.52.
• “a” is more probable than “” (0.52>0.48).
• Beam Search Decoding is preferred over Best
Path

LST Ms Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LST Ms Applications

Uploaded by

Copyright:

Available Formats

Long Short Term Memory Networks

DSC 704 Deep Learning Bahria University, Spring 2019 1

DSC 704 Deep Learning Bahria University, Spring 2019 2

DSC 704 Deep Learning Bahria University, Spring 2019 3

DSC 704 Deep Learning Bahria University, Spring 2019 4

DSC 704 Deep Learning Bahria University, Spring 2019 5

DSC 704 Deep Learning Bahria University, Spring 2019 6

Feed columns/rows of image one by one as a sequence (of length 28)

DSC 704 Deep Learning Bahria University, Spring 2019 7

Feed each row of the image as a sequence of 28 values

DSC 704 Deep Learning Bahria University, Spring 2019 8

DSC 704 Deep Learning Bahria University, Spring 2019 9

DSC 704 Deep Learning Bahria University, Spring 2019 10

DSC 704 Deep Learning Bahria University, Spring 2019 11

DSC 704 Deep Learning Bahria University, Spring 2019 12

DSC 704 Deep Learning Bahria University, Spring 2019 13

DSC 704 Deep Learning Bahria University, Spring 2019 14

DSC 704 Deep Learning Bahria University, Spring 2019 15

DSC 704 Deep Learning Bahria University, Spring 2019 16

DSC 704 Deep Learning Bahria University, Spring 2019 17

DSC 704 Deep Learning Bahria University, Spring 2019 18

DSC 704 Deep Learning Bahria University, Spring 2019 19

DSC 704 Deep Learning Bahria University, Spring 2019 20

DSC 704 Deep Learning Bahria University, Spring 2019 21

DSC 704 Deep Learning Bahria University, Spring 2019 22

DSC 704 Deep Learning Bahria University, Spring 2019 23

DSC 704 Deep Learning Bahria University, Spring 2019 24

DSC 704 Deep Learning Bahria University, Spring 2019 26

In making predictions – Pick the most likely word

DSC 704 Deep Learning Bahria University, Spring 2019 27

DSC 704 Deep Learning Bahria University, Spring 2019 28

The probability of the word “going” after “is” is larger than

DSC 704 Deep Learning Bahria University, Spring 2019 29

DSC 704 Deep Learning Bahria University, Spring 2019 30

DSC 704 Deep Learning Bahria University, Spring 2019 31

DSC 704 Deep Learning Bahria University, Spring 2019 33

DSC 704 Deep Learning Bahria University, Spring 2019 34

DSC 704 Deep Learning Bahria University, Spring 2019 35

DSC 704 Deep Learning Bahria University, Spring 2019 36

If beam search makes these three choices it is rejecting

DSC 704 Deep Learning Bahria University, Spring 2019 37

DSC 704 Deep Learning Bahria University, Spring 2019 38

DSC 704 Deep Learning Bahria University, Spring 2019 39

The cat is red …..

The cat is red ….

The cat is red ….

DSC 704 Deep Learning Bahria University, Spring 2019 40

Time-Steps The: 0.4

The cat is red …..

The cat is red ….

The cat is red ….

DSC 704 Deep Learning Bahria University, Spring 2019 41

The cat is red …..

The cat is red ….

The cat is red ….

DSC 704 Deep Learning Bahria University, Spring 2019 42

The cat is red …..

The cat is red ….

The cat is red ….

DSC 704 Deep Learning Bahria University, Spring 2019 43

The cat is red …..

The cat is red ….

The cat is red ….