Professional Documents
Culture Documents
1
1.2. Recurrent neural network
Recurrent Neural Networks (RNNs) are a type of sequential network that enable the persistence
of information. The inherent chain-like nature of RNNs highlights their close relationship with
sequences and lists. Consequently, RNNs are the ideal neural network architecture for handling
such data [?].
The above network input sequence (x) and output sequence (h).
The problem with RNN is the long term dependencies. Consider trying to predict the last word
in the text “I grew up in France. . . I speak fluent French.” Recent information suggests that the
next word is probably the name of a language, but if we want to narrow down which language,
we need the context of France, from further back. It’s entirely possible for the gap between the
relevant information and the point where it is needed to become very large [?].
In mathematical terms, when training a network, applying the chain rule to calculate the
derivative results in a sequence of derivative products from the previous layers. If this sequence
becomes sufficiently long, it can lead to the issue of “diminishing gradient” or “exploding
gradient”. In the case of diminishing gradient, the result shrink towards zero, while in the
case of an exploding gradient the result grows towards infinity.
In figure 1.4, operation + and × are performed element wise on rows of one-dimensional
matrix, xt can be multi-dimensional.
The LSTM architecture is based on the following key components [?]:
• Cell State (ct ): This represents the memory of the LSTM and can store information over
long sequences. It can be updated, cleared, or read from at each time step.
2
• Hidden State (ht ): The hidden state serves as an intermediary between the cell state and
the external world. It can selectively remember or forget information from the cell state
and produce the output.
• Input Gate (it ): The input gate controls the flow of information into the cell state. It can
learn to accept or reject incoming data.
• Forget Gate (ft ): The forget gate determines what information from the previous cell
state should be retained and what should be discarded. It allows the LSTM to “forget”
irrelevant information.
• Output Gate (ot ): The output gate controls the information that is used to produce the
output at each time step. It decides what part of the cell state should be revealed to the
external world.
The typical flow of data is as follows, with W and U are parameter matrixes to be learned in
model training:
ft = σ(Wf xt + Uf ht−1 + bf )
it = σ(Wi xi + Ui ht−1 bi )
ot = σ(Wo xt + Uo ht−1 + bo )
cˆt = tanh(Wc xt + Uc ht−1 + bc )
ct = ft · ct−1 + it · cˆt
ht = ot · ct
In our text classification problem, each xt is a real number represent word, the last K hidden
states (K is the number of classes) are scores of a data point belonging to each class. Apply
softmax function on hidden states give us the percentage of the data point’s likelihood of
belonging to each class. Subsequently, the cross-entropy loss function is used to calculate the
loss incurred by the model’s prediction.
It achieves this by processing the input sequence in two directions: forward and backward.
The input sequence is split into two parts. One part is processed in the forward direction,
starting from the beginning of the sequence, while the other part is processed in the backward
direction, starting from the end of the sequence. During the forward pass, the forward LSTM
3
layer takes the input sequence and processes it step by step, updating its hidden state and cell
state. The final hidden state of the forward LSTM captures the information from the past
context. Simultaneously, during the backward pass, the backward LSTM layer processes the
input sequence in reverse order, updating its hidden state and cell state. The final hidden state
of the backward LSTM captures the information from the future context. The outputs of both
the forward and backward LSTMs are then combined, usually by concatenating or summing
them element-wise, to create the final output representation of each time step.