You are on page 1of 9

Sequence Transduction with Recurrent Neural Networks

Alex Graves graves@cs.toronto.edu


Department of Computer Science, University of Toronto, Canada

Abstract tions. Moreover this robustness should apply to both


Many machine learning tasks can be ex- the input and output sequences.
arXiv:1211.3711v1 [cs.NE] 14 Nov 2012

pressed as the transformation—or transduc- For example, transforming audio signals into sequences
tion—of input sequences into output se- of words requires the ability to identify speech sounds
quences: speech recognition, machine trans- (such as phonemes or syllables) despite the apparent
lation, protein secondary structure predic- distortions created by different voices, variable speak-
tion and text-to-speech to name but a few. ing rates, background noise etc. If a language model
One of the key challenges in sequence trans- is used to inject prior knowledge about the output se-
duction is learning to represent both the in- quences, it must also be robust to missing words, mis-
put and output sequences in a way that is pronunciations, non-lexical utterances etc.
invariant to sequential distortions such as
shrinking, stretching and translating. Recur- Recurrent neural networks (RNNs) are a promising ar-
rent neural networks (RNNs) are a power- chitecture for general-purpose sequence transduction.
ful sequence learning architecture that has The combination of a high-dimensional multivariate
proven capable of learning such representa- internal state and nonlinear state-to-state dynamics
tions. However RNNs traditionally require offers more expressive power than conventional sequen-
a pre-defined alignment between the input tial algorithms such as hidden Markov models. In
and output sequences to perform transduc- particular RNNs are better at storing and accessing
tion. This is a severe limitation since finding information over long periods of time. While the early
the alignment is the most difficult aspect of years of RNNs were dogged by difficulties in learn-
many sequence transduction problems. In- ing (Hochreiter et al., 2001), recent results have shown
deed, even determining the length of the out- that they are now capable of delivering state-of-the-art
put sequence is often challenging. This pa- results in real-world tasks such as handwriting recog-
per introduces an end-to-end, probabilistic nition (Graves et al., 2008; Graves & Schmidhuber,
sequence transduction system, based entirely 2008), text generation (Sutskever et al., 2011) and lan-
on RNNs, that is in principle able to trans- guage modelling (Mikolov et al., 2010). Furthermore,
form any input sequence into any finite, dis- these results demonstrate the use of long-range mem-
crete output sequence. Experimental results ory to perform such actions as closing parentheses after
for phoneme recognition are provided on the many intervening characters (Sutskever et al., 2011),
TIMIT speech corpus. or using delayed strokes to identify handwritten char-
acters from pen trajectories (Graves et al., 2008).
However RNNs are usually restricted to problems
1. Introduction where the alignment between the input and output
The ability to transform and manipulate sequences is sequence is known in advance. For example, RNNs
a crucial part of human intelligence: everything we may be used to classify every frame in a speech signal,
know about the world reaches us in the form of sen- or every amino acid in a protein chain. If the network
sory sequences, and everything we do to interact with outputs are probabilistic this leads to a distribution
the world requires sequences of actions and thoughts. over output sequences of the same length as the input
The creation of automatic sequence transducers there- sequence. But for a general-purpose sequence trans-
fore seems an important step towards artificial intel- ducer, where the output length is unknown in advance,
ligence. A major problem faced by such systems is we would prefer a distribution over sequences of all
how to represent sequential information in a way that lengths. Furthermore, since we do not how the inputs
is invariant, or at least robust, to sequential distor- and outputs should be aligned, this distribution would
Sequence Transduction with Recurrent Neural Networks

ideally cover all possible alignments. Ȳ ∗ is therefore equivalent to (y1 , y2 , y3 ) ∈ Y ∗ . We


refer to the elements a ∈ Ȳ ∗ as alignments, since the
Connectionist Temporal Classification (CTC) is an
location of the null symbols determines an alignment
RNN output layer that defines a distribution over all
between the input and output sequences. Given x,
alignments with all output sequences not longer than
the RNN transducer defines a conditional distribution
the input sequence (Graves et al., 2006). However, as
Pr(a ∈ Ȳ ∗ |x). This distribution is then collapsed onto
well as precluding tasks, such as text-to-speech, where
the following distribution over Y ∗
the output sequence is longer than the input sequence,
CTC does not model the interdependencies between X
Pr(y ∈ Y ∗ |x) = Pr(a|x) (1)
the outputs. The transducer described in this paper
a∈B−1 (y)
extends CTC by defining a distribution over output
sequences of all lengths, and by jointly modelling both where B : Ȳ ∗ 7→ Y ∗ is a function that removes the null
input-output and output-output dependencies. symbols from the alignments in Ȳ ∗ .
As a discriminative sequential model the transducer Two recurrent neural networks are used to determine
has similarities with ‘chain-graph’ conditional random Pr(a ∈ Ȳ ∗ |x). One network, referred to as the
fields (CRFs) (Lafferty et al., 2001). However the transcription network F, scans the input sequence x
transducer’s construction from RNNs, with their abil- and outputs the sequence f = (f1 , . . . , fT ) of tran-
ity to extract features from raw data and their poten- scription vectors1 . The other network, referred to
tially unbounded range of dependency, is in marked as the prediction network G, scans the output se-
contrast with the pairwise output potentials and hand- quence y and outputs the prediction vector sequence
crafted input features typically used for CRFs. Closer g = (g0 , g1 . . . , gU ).
in spirit is the Graph Transformer Network (Bottou
et al., 1997) paradigm, in which differentiable mod-
2.1. Prediction Network
ules (often neural networks) can be globally trained
to perform consecutive graph transformations such as The prediction network G is a recurrent neural network
detection, segmentation and recognition. consisting of an input layer, an output layer and a
single hidden layer. The length U + 1 input sequence
Section 2 defines the RNN transducer, showing how
ŷ = (∅, y1 , . . . , yU ) to G output sequence y with ∅
it can be trained and applied to test data, Section 3
prepended. The inputs are encoded as one-hot vectors;
presents experimental results on the TIMIT speech
that is, if Y consists of K labels and yu = k, then
corpus and concluding remarks and directions for fu-
ŷu is a length K vector whose elements are all zero
ture work are given in Section 4.
except the k th , which is one. ∅ is encoded as a length
K vector of zeros. The input layer is therefore size
2. Recurrent Neural Network K. The output layer is size K + 1 (one unit for each
Transducer element of Ȳ) and hence the prediction vectors gu are
also size K + 1.
Let x = (x1 , x2 , . . . , xT ) be a length T input se-
quence of arbitrary length belonging to the set X ∗ Given ŷ, G computes the hidden vector sequence
of all sequences over some input space X . Let y = (h0 , . . . , hU ) and the prediction sequence (g0 , . . . , gU )
(y1 , y2 , . . . , yU ) be a length U output sequence belong- by iterating the following equations from u = 0 to U :
ing to the set Y ∗ of all sequences over some output
space Y. Both the inputs vectors xt and the output hu = H (Wih ŷu + Whh hu−1 + bh ) (2)
vectors yu are represented by fixed-length real-valued gu = Who hu + bo (3)
vectors; for example if the task is phonetic speech
recognition, each xt would typically be a vector of where Wih is the input-hidden weight matrix, Whh is
MFC coefficients and each yt would be a one-hot vec- the hidden-hidden weight matrix, Who is the hidden-
tor encoding a particular phoneme. In this paper we output weight matrix, bh and bo are bias terms, and
will assume that the output space is discrete; however H is the hidden layer function. In traditional RNNs
the method can be readily extended to continuous out- H is an elementwise application of the tanh or logis-
put spaces, provided a tractable, differentiable model tic sigmoid σ(x) = 1/(1 + exp(−x)) functions. How-
can be found for Y. 1
For simplicity we assume the transcription sequence
Define the extended output space Ȳ as Y ∪ ∅, where ∅ to be the same length as the input sequence; however this
may not be true, for example if the transcription network
denotes the null output. The intuitive meaning of ∅ uses a pooling architecture (LeCun et al., 1998) to reduce
is ‘output nothing’; the sequence (y1 , ∅, ∅, y2 , ∅, y3 ) ∈ the sequence length.
Sequence Transduction with Recurrent Neural Networks

ever we have found that the Long Short-Term Mem- For a bidirectional LSTM network (Graves & Schmid-
ory (LSTM) architecture (Hochreiter & Schmidhuber, huber, 2005), H is implemented by Eqs. (4) to (8). For
1997; Gers, 2001) is better at finding and exploiting a task with K output labels, the output layer of the
long range contextual information. For the version of transcription network is size K + 1, just like the pre-
LSTM used in this paper H is implemented by the diction network, and hence the transcription vectors
following composite function: ft are also size K + 1.
αn = σ (Wiα in + Whα hn−1 + Wsα sn−1 ) (4) The transcription network is similar to a Connection-
βn = σ (Wiβ in + Whβ hn−1 + Wsβ sn−1 ) (5) ist Temporal Classification RNN, which also uses a
null output to define a distribution over input-output
sn = βn sn−1 + αn tanh (Wis in + Whs ) (6) alignments.
γn = σ (Wiγ in + Whγ hn−1 + Wsγ sn ) (7)
hn = γn tanh(sn ) (8) 2.3. Output Distribution
where α, β, γ and s are respectively the input gate, Given the transcription vector ft , where 1 ≤ t ≤ T ,
forget gate, output gate and state vectors, all of which the prediction vector gu , where 0 ≤ u ≤ U , and label
are the same size as the hidden vector h. The weight k ∈ Ȳ, define the output density function
matrix subscripts have the obvious meaning, for exam-
h(k, t, u) = exp ftk + guk

(12)
ple Whα is the hidden-input gate matrix, Wiγ is the
input-output gate matrix etc. The weight matrices where superscript k denotes the k th element of the
from the state to gate vectors are diagonal, so element vectors. The density can be normalised to yield the
m in each gate vector only receives input from element conditional output distribution:
m of the state vector. The bias terms (which are added
to α, β, s and γ) have been omitted for clarity. h(k, t, u)
Pr(k ∈ Ȳ|t, u) = P 0
(13)
k0 ∈Ȳ h(k , t, u)
The prediction network attempts to model each ele-
ment of y given the previous ones; it is therefore simi- To simplify notation, define
lar to a standard next-step-prediction RNN, only with
y(t, u) ≡ Pr(yu+1 |t, u) (14)
the added option of making ‘null’ predictions.
∅(t, u) ≡ Pr(∅|t, u) (15)
2.2. Transcription Network Pr(k|t, u) is used to determine the transition probabil-
The transcription network F is a bidirectional ities in the lattice shown in Fig. 1. The set of possible
RNN (Schuster & Paliwal, 1997) that scans the input paths from the bottom left to the terminal node in the
sequence x forwards and backwards with two separate top right corresponds to the complete set of alignments
hidden layers, both of which feed forward to a single between x and y, i.e. to the set Ȳ ∗ ∩ B−1 (y). There-
output layer. Bidirectional RNNs are preferred be- fore all possible input-output alignments are assigned
cause each output vector depends on the whole input a probability, the sum of which is the total probability
sequence (rather than on the previous inputs only, as Pr(y|x) of the output sequence given the input se-
is the case with normal RNNs); however we have not quence. Since a similar lattice could be drawn for any
tested to what extent this impacts performance. finite y ∈ Y ∗ , Pr(k|t, u) defines a distribution over
all possible output sequences, given a single input se-
Given a length T input sequence (x1 . . . xT ), a quence.
bidirectional RNN computes the forward hidden

− →
− A naive calculation of Pr(y|x) from the lattice would
sequence ( h 1 , . . . , h T ), the backward hidden se-
←− ←− be intractable; however an efficient forward-backward
quence ( h 1 , . . . , h T ), and the transcription sequence algorithm is described below.
(f1 , . . . , fT ) by first iterating the backward layer from
t = T to 1:
2.4. Forward-Backward Algorithm
←−  ←
− 
h t = H W i← − it + W←
h
−←− h t+1 + b←
h h

h
(9) Define the forward variable α(t, u) as the probability
of outputting y[1:u] during f[1:t] . The forward variables
then iterating the forward and output layers from t = 1
for all 1 ≤ t ≤ T and 0 ≤ u ≤ U can be calculated
to T :
recursively using

−  →
− 
h t = H Wi→ − it + W→
−→ − h t−1 + b→
− (10)
h h h h α(t, u) = α(t − 1, u)∅(t − 1, u)

− ←

ot = W→− h t + W←
ho
− h t + bo
ho
(11) + α(t, u − 1)y(t, u − 1) (16)
Sequence Transduction with Recurrent Neural Networks

log-loss L = − ln Pr(y ∗ |x) of the target sequence. We


do this by calculating the gradient of L with respect
to the network weights parameters and performing
gradient descent. Analysing the diffusion of proba-
bility through the output lattice shows that Pr(y ∗ |x)
is equal to the sum of α(t, u)β(t, u) over any top-left
to bottom-right diagonal through the nodes. That is,
∀ n:1≤n≤U +T
X
Pr(y ∗ |x) = α(t, u)β(t, u) (19)
(t,u):t+u=n

From Eqs. (16), (18) and (19) and the definition of L


it follows that

β(t, u + 1) if k = yu+1
∂L α(t, u) 
=− β(t + 1, u) if k = ∅
Figure 1. Output probability lattice defined by
∂ Pr(k|t, u) Pr(y ∗ |x) 
0 otherwise

Pr(k|t, u). The node at t, u represents the probability of
(20)
having output the first u elements of the output sequence
And therefore
by point t in the transcription sequence. The horizontal ar-
row leaving node t, u represents the probability ∅(t, u) of U X
∂L X ∂L ∂ Pr(k 0 |t, u)
outputting nothing at (t, u); the vertical arrow represents = (21)
k
∂ft 0
∂ Pr(k |t, u) ∂ftk
the probability y(t, u) of outputting the element u + 1 of u=0 0 k ∈Ȳ
y. The black nodes at the bottom represent the null state T X
before any outputs have been emitted. The paths start- ∂L X ∂L ∂ Pr(k 0 |t, u)
= (22)
ing at the bottom left and reaching the terminal node in ∂guk t=1 k0 ∈Ȳ
∂ Pr(k 0 |t, u) ∂guk
the top right (one of which is shown in red) correspond
to the possible alignments between the input and output where, from Eq. (13)
sequences. Each alignment starts with probability 1, and
its final probability is the product of the transition proba- ∂ Pr(k 0 |t, u) ∂ Pr(k 0 |t, u)
bilities of the arrows they pass through (shown for the red = = Pr(k 0 |t, u) [δkk0 −Pr(k|t, u)]
∂ftk ∂guk
path).
The gradient with respect to the network weights
with initial condition α(1, 0) = 1. The total output can then be calculated by applying Backpropagation
sequence probability is equal to the forward variable Through Time (Williams & Zipser, 1995) to each net-
at the terminal node: work independently.

Pr(y|x) = α(T, U )∅(T, U ) (17) A separate softmax could be calculated for every
Pr(k|t, u) required by the forward-backward algo-
Define the backward variable β(t, u) as the probability rithm. However this is computationally expensive due
of outputting y[u+1:U ] during f[t:T ] . Then to the high cost of the exponential function. Recalling
that exp(a + b) = exp(a) exp(b), we can instead pre-
β(t, u) = β(t + 1, u)∅(t, u) + β(t, u + 1)y(t, u) (18)
compute all the exp (f (t, x)) and exp(g(y[1:u] )) terms
with initial condition β(T, U ) = ∅(T, U ). From the and use their products to determine Pr(k|t, u). This
definition of the forward and backward variables it reduces the number of exponential evaluations from
follows that their product α(t, u)β(t, u) at any point O(T U ) to O(T + U ) for each length T transcription
(t, u) in the output lattice is equal to the probabil- sequence and length U target sequence used for train-
ity of emitting the complete output sequence if yu is ing.
emitted during transcription step t. Fig. 2 shows a plot
of the forward variables, the backward variables and 2.6. Testing
their product for a speech recognition task.
When the transducer is evaluated on test data, we
seek the mode of the output sequence distribution in-
2.5. Training
duced by the input sequence. Unfortunately, finding
Given an input sequence x and a target sequence y ∗ , the mode is much harder than determining the prob-
the natural way to train the model is to minimise the ability of a single sequence. The complication is that
Sequence Transduction with Recurrent Neural Networks

Figure 2. Forward-backward variables during a speech recognition task. The image at the bottom is the input
sequence: a spectrogram of an utterance. The three heat maps above that show the logarithms of the forward variables
(top) backward variables (middle) and their product (bottom) across the output lattice. The text to the left is the target
sequence.

the prediction function g(y[1:u] ) (and hence the out- Algorithm 1 Output Sequence Beam Search
put distribution Pr(k|t, u)) may depend on all previous Initalise: B = {∅ ∅}; Pr(∅ ∅) = 1
outputs emitted by the model. The method employed for t = 1 to T do
in this paper is a fixed-width beam search through A=B
the tree of output sequences. The advantage of beam B = {}
search is that it scales to arbitrarily long sequences, for y in A do
and allows computational cost to be traded off against
P
Pr(y) += ŷ∈pref (y)∩A Pr(ŷ) Pr(y|ŷ, t)
search accuracy. end for
Let Pr(y) be the approximate probability of emitting while B contains less than W elements more
some output sequence y found by the search so far. probable than the most probable in A do
Let Pr(k|y, t) be the probability of extending y by y ∗ = most probable in A
k ∈ Ȳ during transcription step t. Let pref (y) be Remove y ∗ from A
the set of proper prefixes of y (including the null se- Pr(y ∗ ) = Pr(y ∗ ) Pr(∅|y, t)
quence ∅ ), and for some ŷ ∈ pref (y), let Pr(y|ŷ, t) = Add y ∗ to B
Q|y| for k ∈ Y do
u=|ŷ|+1 Pr(yu |y[0:u−1] , t). Pseudocode for a width Pr(y ∗ + k) = Pr(y ∗ ) Pr(k|y ∗ , t)
W beam search for the output sequence with high- Add y ∗ + k to A
est length-normalised probability given some length T end for
transcription sequence is given in Algorithm 1. end while
The algorithm can be trivially extended to an N best Remove all but the W most probable from B
search (N ≤ W ) by returning a sorted list of the N end for
best elements in B instead of the single best element. Return: y with highest log Pr(y)/|y| in B
The length normalisation in the final line appears
to be important for good performance, as otherwise
shorter output sequences are excessively favoured over combined with the transcription vectors to compute
longer ones; similar techniques are employed for hid- the probabilities. This procedure greatly accelerates
den Markov models in speech and handwriting recog- the beam search, at the cost of increased memory use.
nition (Bertolami et al., 2006). Note that for LSTM networks both the hidden vectors
h and the state vectors s should be stored.
Observing from Eq. (2) that the prediction network
outputs are independent of previous hidden vectors
given the current one, we can iteratively compute the 3. Experimental Results
prediction vectors for each output sequence y + k con-
To evaluate the potential of the RNN transducer we
sidered during the beam search by storing the hidden
applied it to the task of phoneme recognition on the
vectors for all y, and running Eq. (2) for one step
TIMIT speech corpus (DAR, 1990). We also compared
with k as input. The prediction vectors can then be
its performance to that of a standalone next-step pre-
Sequence Transduction with Recurrent Neural Networks

diction RNN and a standalone Connectionist Tempo-


Table 1. Phoneme Recognition Results on the
ral Classification (CTC) RNN, to gain insight into the
TIMIT Speech Corpus. ‘Log-loss’ is in units of bits per
interaction between the two sources of information. target phoneme. ‘Epochs’ is the number of passes through
the training set before convergence.
3.1. Task and Data
The core training and test sets of TIMIT (which we Network Epochs Log-loss Error Rate
used for our experiments) contain respectively 3696
and 192 phonetically transcribed utterances. We de- Prediction 58 4.0 72.9%
fined a validation set by randomly selecting 184 se- CTC 96 1.3 25.5%
Transducer 76 1.0 23.2%
quences from the training set; this put us at a slight
disadvantage compared to many TIMIT evaluations,
where the validation set is drawn from the non-core
test set, and all 3696 sequences are used for training.
using a learning rate of 10−4 and a momentum of
The reduced set of 39 phoneme targets (Lee & Hon,
0.9. Gaussian weight noise (Jim et al., 1996) with a
1989) was used during both training and testing.
standard deviation of 0.075 was injected during train-
Standard speech preprocessing was applied to trans- ing to reduce overfitting. The prediction and trans-
form the audio files into feature sequences. 26 channel duction networks were stopped at the point of lowest
mel-frequency filter bank and a pre-emphasis coeffi- log-loss on the validation set; the CTC network was
cient of 0.97 were used to compute 12 mel-frequency stopped at the point of lowest phoneme error rate on
cepstral coefficients plus an energy coefficient on 25ms the validation set. All network were initialised with
Hamming windows at 10ms intervals. Delta coeffi- uniformly distributed random weights in the range [-
cients were added to create input sequences of length 0.1,0.1]. For the CTC network, prefix search decod-
26 vectors, and all coefficient were normalised to have ing (Graves et al., 2006) was used to transcribe the
mean zero and standard deviation one over the train- test set, with a probability threshold of 0.995. For the
ing set. transduction network, the beam search algorithm de-
scribed in Algorithm 1 was used with a beam width of
The standard performance measure for TIMIT is the
4000.
phoneme error rate on the test set: that is, the
summed edit distance between the output sequences
and the target sequences, divided by the total length 3.3. Results
of the target sequences. Phoneme error rate, which The results are presented in Table 1. The phoneme er-
is customarily presented as a percentage, is recorded ror rate of the transducer is among the lowest recorded
for both the transcription network and the transducer. on TIMIT (the current benchmark is 20.5% (Dahl
The error recorded for the prediction network is the et al., 2010)). As far as we are aware, it is the best
misclassification rate of the next phoneme given the result with a recurrent neural network.
previous ones.
Nonetheless the advantage of the transducer over the
We also record the log-loss on the test set. To put this CTC network on its own is relatively slight. This may
quantity in more accessible terms we convert it into be because the TIMIT transcriptions are too small a
the average number of bits per phoneme target. training set for the prediction network: around 150K
labels, as opposed to the millions of words typically
3.2. Network Parameters used to train language models. This is supported by
the poor performance of the standalone prediction net-
The prediction network consisted of a size 128 LSTM
work: it misclassifies almost three quarters of the tar-
hidden layer, 39 input units and 40 output units. The
gets, and its per-phoneme loss is not much better than
transcription network consisted of two size 128 LSTM
the entropy of the phoneme distribution (4.6 bits). We
hidden layers, 26 inputs and 40 outputs. This gave
would therefore hope for a greater improvement on a
a total of 261,328 weights in the RNN transducer.
larger dataset. Alternatively the prediction network
The standalone prediction and CTC networks (which
could be pretrained on a large ‘target-only’ dataset,
were structurally identical to their counterparts in the
then jointly retrained on the smaller dataset as part
transducer, except that the prediction network had one
of the transducer. The analogous procedure in HMM
fewer output unit) had 91,431 and 169,768 weights
speech recognisers is to combine language models ex-
respectively. All networks were trained with online
tracted from large text corpora with acoustic models
steepest descent (weight updates after every sequence)
trained on smaller speech corpora.
Sequence Transduction with Recurrent Neural Networks

3.4. Analysis Speech Corpus (TIMIT). DARPA-ISTO, speech disc


cd1-1.1 edition, 1990.
One advantage of a differentiable system is that the
Gers, F. Long Short-Term Memory in Recurrent
sensitivity of each component to every other compo-
Neural Networks. PhD thesis, Ecole Polytechnique
nent can be easily calculated. This allows us to analyse
Fédérale de Lausanne, 2001.
the dependency of the output probability lattice on its
Graves, A. and Schmidhuber, J. Framewise phoneme
two sources of information: the input sequence and
classification with bidirectional LSTM and other
the previous outputs. Fig. 3 visualises these relation-
neural network architectures. Neural Networks, 18:
ships for an RNN transducer applied to ‘end-to-end’
602–610, 2005.
speech recognition, where raw spectrogram images are
Graves, A. and Schmidhuber, J. Offline handwriting
directly transcribed with character sequences with no
recognition with multidimensional recurrent neural
intermediate conversion into phonemes.
networks. In NIPS, 2008.
Graves, A., Fernández, S., Gomez, F., and Schmidhu-
4. Conclusions and Future Work ber, J. Connectionist temporal classification: La-
belling unsegmented sequence data with recurrent
We have introduced a generic sequence transducer
neural networks. In ICML, 2006.
composed of two recurrent neural networks and
Graves, A., Fernández, S., Liwicki, M., Bunke, H., and
demonstrated its ability to integrate acoustic and lin-
Schmidhuber, J. Unconstrained Online Handwriting
guistic information during a speech recognition task.
Recognition with Recurrent Neural Networks. In
We are currently training the transducer on large-scale NIPS. 2008.
speech and handwriting recognition databases. Some Hochreiter, S. and Schmidhuber, J. Long Short-
of the illustrations in this paper are drawn from an Term Memory. Neural Computation, 9(8):1735–
ongoing experiment in end-to-end speech recognition. 1780, 1997.
Hochreiter, S., Bengio, Y., Frasconi, P., and Schmid-
In the future we would like to look at a wider range
huber, J. Gradient Flow in Recurrent Nets: the
of sequence transduction problems, particularly those
Difficulty of Learning Long-term Dependencies. In
that are difficult to tackle with conventional algo-
Kremer, S. C. and Kolen, J. F. (eds.), A Field Guide
rithms such as HMMs. One example would be text-to-
to Dynamical Recurrent Neural Networks. 2001.
speech, where a small number of discrete input labels
Jim, Kam-Chuen, Giles, C., and Horne, B. An analysis
are transformed into long, continuous output trajecto-
of noise in recurrent neural networks: convergence
ries. Another is machine translation, which is partic-
and generalization. Neural Networks, IEEE Trans-
ularly challenging due to the complex alignment be-
actions on, 7(6):1424 –1438, 1996.
tween the input and output sequences.
Lafferty, J. D., McCallum, A., and Pereira, F. C. N.
Conditional Random Fields: Probabilistic Models
Acknowledgements for Segmenting and Labeling Sequence Data. In
Ilya Sutskever, Chris Maddison and Geoffrey Hinton ICML, 2001.
provided helpful discussions and suggestions for this LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.
work. Alex Graves is a Junior Fellow of the Canadian Gradient-based learning applied to document recog-
Institute for Advanced Research. nition. Proceedings of the IEEE, 86(11):2278–2324,
1998.
Lee, K. and Hon, H. Speaker-independent phone
References recognition using hidden markov models. IEEE
Bertolami, R, Zimmermann, M, and Bunke, H. Re- Transactions on Acoustics, Speech, and Signal Pro-
jection strategies for offline handwritten text line cessing, 1989.
recognition. Pattern Recognition Letters, 27(16): Mikolov, T., Karafit, M., Burget, L., Cernocky, J.,
2005–2012, 2006. and Khudanpur, S. Recurrent neural network based
Bottou, Leon, Bengio, Yoshua, and Cun, Yann Le. language model. In Eleventh Annual Conference of
Global training of document processing systems us- the International Speech Communication Associa-
ing graph transformer networks. In CVPR, pp. 489–, tion, 2010.
1997. Schuster, M. and Paliwal, K. Bidirectional recurrent
Dahl, G., Ranzato, M., Mohamed, A., and Hinton, neural networks. IEEE Transactions on Signal Pro-
G. Phone recognition with the mean-covariance re- cessing, 45:2673–2681, 1997.
stricted boltzmann machine. In NIPS. 2010. Sutskever, I., Martens, J., and Hinton, G. Generating
The DARPA TIMIT Acoustic-Phonetic Continuous text with recurrent neural networks. In ICML, 2011.
Sequence Transduction with Recurrent Neural Networks

Williams, R. and Zipser, D. Gradient-based learning


algorithms for recurrent networks and their compu-
tational complexity. In Back-propagation: Theory,
Architectures and Applications, pp. 433–486. 1995.
Sequence Transduction with Recurrent Neural Networks

Figure 3. Visualisation of the transducer applied to end-to-end speech recognition. As in Fig. 2, the heat
map in the top right shows the log-probability of the target sequence passing through each point in the output lattice.
The image immediately below that shows the input sequence (a speech spectrogram), and the image immediately to the
left shows the inputs to the prediction network (a series of one-hot binary vectors encoding the target characters). Note
the learned ‘time warping’ between the two sequences. Also note the blue ‘tendrils’, corresponding to low probability
alignments, and the short vertical segments, corresponding to common character sequences (such as ‘TH’ and ‘HER’)
emitted during a single input step.

The bar graphs in the bottom left indicate the labels most strongly predicted by the output distribution (blue),
the transcription function (red) and the prediction function (green) at the point in the output lattice indicated by
the crosshair. In this case the transcription network simultaneously predicts the letters ‘O’, ‘U’ and ‘L’, presumably
because these correspond to the vowel sound in ‘SHOULD’; the prediction network strongly predicts ‘O’; and the output
distribution sums the two to give highest probability to ‘O’.

The heat map below the input sequence shows the sensitivity of the probability at the crosshair to the pixels in
the input sequence; the heat map to the left of the prediction inputs shows the sensitivity of the same point to the
previous outputs. The maps suggest that both networks are sensitive to long range dependencies, with visible effects
extending across the length of input and output sequences. Note the dark horizontal bands in the prediction heat map;
these correspond to a lowered sensitivity to spaces between words. Similarly the transcription network is more sensitive
to parts of the spectrogram with higher energy. The sensitivity of the transcription network extends in both directions
because it is bidirectional, unlike the prediction network.

You might also like