You are on page 1of 4

Chapter 1

Long-short term memory neural


network

1.1. Activation function


Activation functions are mathematical functions applied to the output of a neuron in a neural
network. They introduce non-linearity to the network, allowing it to learn complex patterns and
make non-linear transformations to the input data. Long-short term memory (LSTM) neural
network primary use two activation functions: sigmoid and hyperbolic tangent.
1
• The sigmoid function is defined by formula σ(x) = 1+exp(−x) . It has several important
properties valuable for various machine learning tasks: It is differentiable at every point in
the real numbers, it is monotonic increasing, and it maps the real numbers to the interval
(0, 1).

Figure 1.1: Sigmoid function graph

• The hyperbolic tangent function can be defined by formula tanh(x) = exp(x)−exp(−x)


exp(x)+exp(−x) .
Similar to sigmoid function, it is differentiable at every point in the real numbers, it is
monotonic increasing, and it maps the real numbers to the interval (−1, 1).

Figure 1.2: Hyperbolic tangent function graph

1
1.2. Recurrent neural network
Recurrent Neural Networks (RNNs) are a type of sequential network that enable the persistence
of information. The inherent chain-like nature of RNNs highlights their close relationship with
sequences and lists. Consequently, RNNs are the ideal neural network architecture for handling
such data [?].

Figure 1.3: Basic recurrent neural network model

The above network input sequence (x) and output sequence (h).
The problem with RNN is the long term dependencies. Consider trying to predict the last word
in the text “I grew up in France. . . I speak fluent French.” Recent information suggests that the
next word is probably the name of a language, but if we want to narrow down which language,
we need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large [?].
In mathematical terms, when training a network, applying the chain rule to calculate the
derivative results in a sequence of derivative products from the previous layers. If this sequence
becomes sufficiently long, it can lead to the issue of “diminishing gradient” or “exploding
gradient”. In the case of diminishing gradient, the result shrink towards zero, while in the
case of an exploding gradient the result grows towards infinity.

1.3. Long Short Term Memory neural network


Long Short Term Memory networks (LSTM) – are a special kind of RNN, capable of learning
long-term dependencies. LSTMs also have chain like structure, but the repeating module has a
different structure.

Figure 1.4: Long-Short Term Memory cell

In figure 1.4, operation + and × are performed element wise on rows of one-dimensional
matrix, xt can be multi-dimensional.
The LSTM architecture is based on the following key components [?]:
• Cell State (ct ): This represents the memory of the LSTM and can store information over
long sequences. It can be updated, cleared, or read from at each time step.

2
• Hidden State (ht ): The hidden state serves as an intermediary between the cell state and
the external world. It can selectively remember or forget information from the cell state
and produce the output.
• Input Gate (it ): The input gate controls the flow of information into the cell state. It can
learn to accept or reject incoming data.
• Forget Gate (ft ): The forget gate determines what information from the previous cell
state should be retained and what should be discarded. It allows the LSTM to “forget”
irrelevant information.

• Output Gate (ot ): The output gate controls the information that is used to produce the
output at each time step. It decides what part of the cell state should be revealed to the
external world.
The typical flow of data is as follows, with W and U are parameter matrixes to be learned in
model training:

ft = σ(Wf xt + Uf ht−1 + bf )
it = σ(Wi xi + Ui ht−1 bi )
ot = σ(Wo xt + Uo ht−1 + bo )
cˆt = tanh(Wc xt + Uc ht−1 + bc )
ct = ft · ct−1 + it · cˆt
ht = ot · ct

In our text classification problem, each xt is a real number represent word, the last K hidden
states (K is the number of classes) are scores of a data point belonging to each class. Apply
softmax function on hidden states give us the percentage of the data point’s likelihood of
belonging to each class. Subsequently, the cross-entropy loss function is used to calculate the
loss incurred by the model’s prediction.

1.4. Bidirection Long-Short Term Memory neural network


Bidirectional LSTM is a variant of the traditional LSTM model that incorporates information
from both past and future contexts.

Figure 1.5: Bidirection Long-Short Term Memory neural network

It achieves this by processing the input sequence in two directions: forward and backward.
The input sequence is split into two parts. One part is processed in the forward direction,
starting from the beginning of the sequence, while the other part is processed in the backward
direction, starting from the end of the sequence. During the forward pass, the forward LSTM

3
layer takes the input sequence and processes it step by step, updating its hidden state and cell
state. The final hidden state of the forward LSTM captures the information from the past
context. Simultaneously, during the backward pass, the backward LSTM layer processes the
input sequence in reverse order, updating its hidden state and cell state. The final hidden state
of the backward LSTM captures the information from the future context. The outputs of both
the forward and backward LSTMs are then combined, usually by concatenating or summing
them element-wise, to create the final output representation of each time step.

You might also like