You are on page 1of 5

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS

Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

Department of Computer Science, University of Toronto

ABSTRACT RNNs are inherently deep in time, since their hidden state
is a function of all previous hidden states. The question that
Recurrent neural networks (RNNs) are a powerful model for
inspired this paper was whether RNNs could also benefit from
sequential data. End-to-end training methods such as Connec-
depth in space; that is from stacking multiple recurrent hid-
tionist Temporal Classification make it possible to train RNNs
den layers on top of each other, just as feedforward layers are
for sequence labelling problems where the input-output align-
stacked in conventional deep networks. To answer this ques-
ment is unknown. The combination of these methods with
tion we introduce deep Long Short-term Memory RNNs and
the Long Short-term Memory RNN architecture has proved
assess their potential for speech recognition. We also present
particularly fruitful, delivering state-of-the-art results in cur-
an enhancement to a recently introduced end-to-end learning
sive handwriting recognition. However RNN performance in
method that jointly trains two separate RNNs as acoustic and
speech recognition has so far been disappointing, with better
linguistic models [10]. Sections 2 and 3 describe the network
results returned by deep feedforward networks. This paper in-
architectures and training methods, Section 4 provides exper-
vestigates deep recurrent neural networks, which combine the
imental results and concluding remarks are given in Section 5.
multiple levels of representation that have proved so effective
in deep networks with the flexible use of long range context
that empowers RNNs. When trained end-to-end with suit- 2. RECURRENT NEURAL NETWORKS
able regularisation, we find that deep Long Short-term Mem-
Given an input sequence x = (x1 , . . . , xT ), a standard recur-
ory RNNs achieve a test set error of 17.7% on the TIMIT
rent neural network (RNN) computes the hidden vector se-
phoneme recognition benchmark, which to our knowledge is
quence h = (h1 , . . . , hT ) and output vector sequence y =
the best recorded score.
(y1 , . . . , yT ) by iterating the following equations from t = 1
Index Terms— recurrent neural networks, deep neural to T :
networks, speech recognition
ht = H (Wxh xt + Whh ht−1 + bh ) (1)
yt = Why ht + by (2)
1. INTRODUCTION
where the W terms denote weight matrices (e.g. Wxh is the
Neural networks have a long history in speech recognition, input-hidden weight matrix), the b terms denote bias vectors
usually in combination with hidden Markov models [1, 2]. (e.g. bh is hidden bias vector) and H is the hidden layer func-
They have gained attention in recent years with the dramatic tion.
improvements in acoustic modelling yielded by deep feed- H is usually an elementwise application of a sigmoid
forward networks [3, 4]. Given that speech is an inherently function. However we have found that the Long Short-Term
dynamic process, it seems natural to consider recurrent neu- Memory (LSTM) architecture [11], which uses purpose-built
ral networks (RNNs) as an alternative model. HMM-RNN memory cells to store information, is better at finding and ex-
systems [5] have also seen a recent revival [6, 7], but do not ploiting long range context. Fig. 1 illustrates a single LSTM
currently perform as well as deep networks. memory cell. For the version of LSTM used in this paper [14]
Instead of combining RNNs with HMMs, it is possible H is implemented by the following composite function:
to train RNNs ‘end-to-end’ for speech recognition [8, 9, 10].
it = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (3)
This approach exploits the larger state-space and richer dy-
namics of RNNs compared to HMMs, and avoids the prob- ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (4)
lem of using potentially incorrect alignments as training tar- ct = ft ct−1 + it tanh (Wxc xt + Whc ht−1 + bc ) (5)
gets. The combination of Long Short-term Memory [11], an ot = σ (Wxo xt + Who ht−1 + Wco ct + bo ) (6)
RNN architecture with an improved memory, with end-to-end
ht = ot tanh(ct ) (7)
training has proved especially effective for cursive handwrit-
ing recognition [12, 13]. However it has so far made little where σ is the logistic sigmoid function, and i, f , o and c
impact on speech recognition. are respectively the input gate, forget gate, output gate and

uses a softmax layer to define a sepa- → −  → −  rate output distribution Pr(k|t) at every step t along the in- h t = H Wx→ − xt + W→ −→− h t−1 + b→ − (8) h h h h put sequence. Long Short-term Memory Cell Deep bidirectional RNNs can be implemented by replacing each hidden sequence hn with the forward and backward se- − → ←− quences h n and h n . As illustrated in length of z as U . which are network. and the number of possible phonemes as K.g. Bidirectional RNN directly from acoustic to phonetic sequences. Intuitively the net- → − ← − yt = W→ − h t + W← − h t + by (10) work decides whether to emit any label. with the out- put sequence of one layer forming the input sequence for the next. The weight matrices from the cell to gate first step is to to use the network outputs to parameterise a vectors (e. The One shortcoming of conventional RNNs is that they are log-probability log Pr(z|x) of the target output sequence z only able to make use of previous context. ←− the backward hidden sequence h and the output sequence y 3. Deep RNNs can be created by stacking mul- tiple RNN hidden layers on top of each other. so element m in each gate differentiable distribution Pr(y|x) over all possible phonetic vector only receives input from element m of the cell vector. a BRNN computes the forward hidden sequence h . and we find that it yields a dramatic improvement over single-layer LSTM. We refer throughout to the length of x as T . This distribution covers the K phonemes plus ← −  ← −  an extra blank symbol ∅ which represents a non-output (the h t = H Wx← − x + W← h t −←− h t+1 + b← h h − h (9) softmax layer is therefore size K + 1). In speech recog. the forward The first method. where RNNs learn to map Fig. Assuming the same hidden layer function is used for all N layers in the stack. Taken together these decisions define a distribu- Combing BRNNs with LSTM gives bidirectional LSTM [16]. can then be differentiated with respect to the network weights nition. which can access long-range context in both input directions. tem can be optimised with gradient descent. The hidden vector h. or no label. As far as we are aware this is the first time deep LSTM has been applied to speech recognition. there using backpropagation through time [17]. − → Fig. the hidden vector sequences hn are iteratively computed from n = 1 to N and t = 1 to T : hnt = H Whn−1 hn hn−1 + Whn hn hnt−1 + bnh  t (11) where we define h0 = x. A crucial element of the recent success of hybrid HMM- neural network systems is the use of deep architectures. known as Connectionist Temporal Classi- layer from t = 1 to T and then updating the output layer: fication (CTC) [8. and the whole sys- is no reason not to exploit future context as well. Bidirec. NETWORK TRAINING We focus on end-to-end training. all of which are the same size as the (and error-prone) alignment to create the training targets. The network outputs yt are yt = WhN y hN t + by (12) Fig. Wsi ) are diagonal. 3. output sequences y given an acoustic input sequence x. at every hy hy timestep. CTC then uses a forward-backward algorithm to sum over all . and ensuring that every hidden layer receives input from both the forward and backward layers at the level below. tion over alignments between the input and target sequences. where whole utterances are transcribed at once. 9]. Connectionist Temporal Classification by iterating the backward layer from t = T to 1. which are able to build up progressively higher level representations of acoustic data. If LSTM is used for the hidden layers we get deep bidirectional LSTM. 2. 2. 1. We now describe tional RNNs (BRNNs) [15] do this by processing the data in two ways to define the output distribution and hence train the both directions with two separate hidden layers. One advantage of this approach is that it removes the need for a predefined cell activation vectors.1. the then fed forwards to the same output layer. the main architecture used in this paper.

u [k ]) Pr(k|t) = PK .u + by (17) → − ← − exp(yt. ing sequence. Decoding Whereas CTC determines an output distribution at every RNN transducers can be decoded with beam search [10] to input timestep.u + Wpb pu + bh ) (16) bidirectional networks. malisation (which only helps when there are many more dele- a ‘linguistic’ distribution Pr(k|u) from the prediction net. u) = PK . backward hidden sequences of the CTC network. to en. u handwriting recognition to integrate out over possible seg. noise (the addition of Gaussian noise to the network weights In particular we have found that the number of deletion errors during training [22]). For a modification that the output label probabilities Pr(k|t. Note the n-best list from the transducer forward-backward algorithm to determine log Pr(z|x) [10]. or by simply taking the ers the K phonemes plus ∅. Joint LM-acoustic training has proved beneficial in the past for speech recognition [20.u . However they appear to work better when initialised with the weights of a pretrained CTC network and a pre- 3. 21]. with Pr(k|t) defined as follows: yt. then multiplying the two together and renormalising. as with a softmax function to yield Pr(k|t. Weight noise was added once per train- encountered during decoding is reduced. Weight noise . In this work we focus on deep ht. output layers to be the same size (| h nt | = | h nt | = |pu | = malised output vector yt . − → ← − the output network is implemented by feeding h N and h N mentations [18. An improvement introduced in this paper is to instead feed 3. ht. larisers were used in this paper: early stopping and weight tic information. The output layers (and pends only on the acoustic input sequence x. (18) 0 exp(yt [k]) k0 =1 exp(yt. rather than at every timestep. This allows a their flexibility makes them prone to overfitting. with the input sequence and the outputs it has already emitted. u): RNNs trained with CTC are generally bidirectional. however CTC differs in that it ignores to a linear layer to generate the vector lt . For simplicity we constrained all non- → − ← − where yt [k] is the k th element of the length K + 1 unnor. levels.u = Why ht. 19]. Two regu- richer set of possibilities for combining linguistic and acous. and appears to lead to better generalisation.u [k]) −N h N yt = W→ t + W← − hNt + by (13) h y h Ny Pr(k|t. → − ←− sure that every Pr(k|t) depends on the entire input sequence. u). Intuitively the network ‘de. u) do length U target sequence z. model. in the current work we dispense with the nor- ing an ‘acoustic’ distribution Pr(k|t) from the CTC network. was originally sorted by the length normalised log-probabilty In the original formulation Pr(k|t.u = tanh (Wlh lt. then feeding lt and segmentation altogether and sums over single-timestep label pu to a tanh hidden layer to yield ht. 3. known as an ing are removed during retraining. lt = W→−N h N t + W← − hNt + bl (15) h l h Nl and not just the inputs up to t. At each t. In the past tribution Pr(k|t. work. log Pr(y)/|y|. the complete set of T U decisions not depend on the previous outputs (so Pr(k|t.u [k] is the k th element of the length K + 1 unnor- malised output vector. tions than insertions) and sort by Pr(y). an RNN transducer determines a separate dis. however they could be varied independently. Regularisation the hidden activations of both networks into a separate feed- forward output network. and N is the number of bidirectional |lt | = |ht. whose outputs are then normalised Regularisation is vital for good performance with RNNs. first decoding known as prefix search.u |). network starts from random weights). most active output at every timestep [8]. each distribution cov. As with CTC.u to a size K + 1 softmax layer to determine Pr(k|t. − → ←− possible alignments and determine the normalised probability Denote by h N and h N the uppermost forward and Pr(z|x) of the target sequence given the input sequence [8]. which can then be integrated out with a fix search for CTC. In this work however cides’ what to output depending both on where it is in the we exploit the same beam search as the transducer. RNN transducers can be trained from random initial weights. u) = Pr(k|t)). jointly determines a distribution over all possible alignments We find beam search both faster and more effective than pre- between x and z. RNN Transducer trained next-step prediction network (so that only the output CTC defines a distribution over phoneme sequences that de. A recent augmentation. thereby yielding a jointly trained acoustic and language would make more sense to pretrain on a separate text corpus. u) was defined by tak. and by p Similar procedures have been used elsewhere in speech and the hidden sequence of the prediction network.3. however for large-scale applications it ones. yield an n-best list of candidate transcriptions. and finally feeding decisions instead. In this work we pretrain RNN transducer [10] combines a CTC-like network with a the prediction network on the phonetic transcriptions of the separate RNN that predicts each phoneme given the previous audio training data. (14) 0 k0 =1 exp(yt [k ]) where yt. It is therefore all associated weights) used by the networks during pretrain- an acoustic-only model.4. u) for every combination of input timestep t CTC networks have been decoded using either a form of best- and output timestep u.2.

used. then mapped to 39 classes for scoring [26].8M 82 23.8M 150 18. together with their first and second temporal derivatives. nine RNNs were evaluated. We have shown that the combination of deep. and initialised. As shown in Table 1. starting from the because CTC (which lacks an explicit language model) at- point of highest log-probability on the development set. The and random initial weights drawn uniformly from [−0. along with the weights of a phoneme prediction net. with learning rate 10−4 .8M 115 19. the number of hidden levels (1–5). retrained with noise. in the sense of reducing Table 1. (b) bidirectional . CTC- tend the system to large vocabulary speech recognition. All 61 phoneme labels were used during training and decoding (so K = 61). previous findings for deep networks [3]).6% Phoneme recognition experiments were performed on the CTC-1 L -250 H 0.9% TIMIT corpus [25]. TIMIT ground truth segmentation is shown below.3% filter-bank with 40 coefficients (plus energy) distributed on P RE T RANS -3 L -250 H 4. 24]. The heatmap were unidirectional. (c) depth is more important than layer size (which supports 3l-250h. gence. Although the advan- work (which also had a hidden layer of 250 LSTM cells).7% a mel-scale.075) until the point of lowest phoneme error rate on the development set.3M 55 21. Input Sensitivity of a deep CTC RNN.0% development set of 50 speakers was used for early stop.3M 112 18. All networks were trained using stochas. with a beam width of 100.8M 87 23. tempts to learn linguistic dependencies from the acoustic data.8M 124 18. this may be were first trained with no noise and then. and CTC-3l-421h-uni where the LSTM layers Fig.4% weight noise gives state-of-the-art results in phoneme recog- as the number of hidden levels increases from one to five. CONCLUSIONS AND FUTURE WORK used for all networks. nition on the TIMIT database. CTC-1l-622h. re. Results are reported for the 24-speaker core test set. ‘PER’ is the phoneme error rate on the core test set. CTC-5 L -250 H 6.7M 107 37. which had tanh units instead of LSTM cells. Long Short-term Memory RNNs with end-to-end training and with the error rate for CTC dropping from 23. three main conclusions we can draw from this are (a) LSTM works much better than tanh for this task. An obvious next step is to ex- The four networks CTC-3l-500h-tanh. CTC-3l- 500h-tanh was entirely trained without weight noise because it failed to learn with noise added. and a separate CTC-2 L -250 H 2. both tage of the transducer is slight when the weights are randomly of which were trained without noise.9% to 18.4% The audio data was encoded using a Fourier-transform-based T RANS -3 L -250 H 4.3M 144 17. vary- ing along three main dimensions: the training method used (CTC. An- 3l-421h-uni and CTC-3l-250h all had approximately the same other interesting direction would be to combine frequency- number of weights. The data were normalised so that every element of the input vec- tors had zero mean and unit variance over the training set. Beam search decoding was 5. EXPERIMENTS N ETWORK W EIGHTS E POCHS PER CTC-3 L -500 H . Bidirectional LSTM was used for all networks except CTC-3l-500h-tanh.UNI 3. Transducer or pretrained Transducer). (top) shows the derivatives of the ‘ah’ and ‘p’ outputs printed tic gradient descent. it becomes more substantial when pretraining is stopped at the point of highest log-probability. LSTM has a slight advantage over unidirectional LSTMand PreTrans-3l-250h was initialised with the weights of CTC. Each input vector was therefore size 123.tends to ‘simplify’ neural networks. 250h was trained from this point with noise added. Note that All networks except CTC-3l-500h-tanh and PreTrans-3l-250h the sensitivity extends to surrounding segments.6% ping. The standard 462 speaker set with all CTC-1 L -622 H 3.9 in red with respect to the filterbank inputs (bottom). The domain convolutional neural networks [27] with deep LSTM. PreTrans-3l. 3. ‘Epochs’ is the amount of information required to transmit the parame- the number of passes through the training set before conver- ters [23. 0.TANH 3. momentum 0. TIMIT Phoneme Recognition Results. so the vari- ance due to random weight initialisation and weight noise is unknown.1.6% CTC-3 L -250 H 3. Note that all experiments were run only once. which improves generalisation. and the number of LSTM cells in each hidden layer. but give radically different results. bidirectional The advantage of deep networks is immediately obvious. trained with Gaussian weight noise (σ = 0. CTC-3 L -421 H .0% SA records removed was used for training. 4.1].

A. 6. neural networks simple by minimizing the description length of the weights. 1735– [26] Kai fu Lee and Hsiao wuen Hon. T. Schuster and K. 5. Nelson Morgan. “SCARF: A seg- N. 18. Giles. and B. tions on. [16] A. speech disc cd1- [11] S.” Audio. pp. 2348–2356. Memory. 1. 1989.” in ICML. and Li Deng. 82 –97. pp. pp. Hinton. Neural Networks. and Transactions on Acoustics. C. vol. Recognition with Recurrent Neural Networks.” in ICML Representation Learning Work.” in NIPS. S. 1424 –1438. Gomez.” IEEE Transactions on Signal Stolcke. Hinton. vestigation of full-sequence training of deep belief net- works for speech recognition.” Jour- [1] H. pp. 1988. “Discriminatively estimated in ICASSP. duration. Vanhoucke. Vinyals. “Forward- 2012.” Signal Pro- cessing Magazine. pp. vol. and Daniel Povey. vol. and and Language Processing. Robinson. P.A. “Tandem connectionist feature extraction for Processing. “Speaker-independent 1780. 2012. 5-6. pp. [7] A. conversational speech recognition. IEEE Transactions on. Mohamed. and J. “Applying convolutional neural networks concepts to [13] Alex Graves and Juergen Schmidhuber. 2006.” in writing Recognition with Multidimensional Recurrent ICASSP. vol. [15] M. ral Network Architectures. Gers. and B. Continuous Speech Corpus (TIMIT). pp. Le. no. Mohamed. 2673–2681. 2010. 2008.E. Nguyen.” in NIPS. Rumelhart. [10] A. 4277 –4280. 2011.” in International Conference on Machine Learning for Multimodal Inter. 3. MLMI’04. [25] DARPA-ISTO. pp. [18] Georey Zweig and Patrick Nguyen. [4] G. Jaitly. Fernández. pp. propagating errors. Geoffrey E. “Connectionist Temporal Classification: Labelling Un- segmented Sequence Data with Recurrent Neural Net. march 2012. “An Application of Recurrent Nets to Phone Probability Estimation. Horne. phone recognition using hidden markov models. “Unconstrained Online Handwriting ing. Bourlard and N.” in INTERSPEECH. nov 1996. Barry Chen. “Sequence transduction with recurrent neu- ral networks. Learning representations by back- 20. ysis of noise in recurrent neural networks: convergence and generalization. “Long Short-Term 1. 1994. IEEE Transac- [8] A. 602–610. Hui Jiang. vol.” in COLT. Schmidhuber. M. 115–143. Schmidhuber.1 edition. Pittsburgh. Publishers.” Neural Computation. 29.N. 1990. 2. 6. . “Keeping the works. backward retraining of recurrent neural networks. no. 223– Classification with Bidirectional LSTM and Other Neu- 231. Graves. 7. vol.” in in Interspeech. 2012. 5542 –5545. [24] Alex Graves.L. Springer-Verlag. “An anal- in robust asr. “Recurrent neural networks for noise reduction [22] Kam-Chuen Jim. Senior. and G.” in ICASSP. Graves. P. IEEE. no.” in NIPS. A. 2005. nov. Senior and Anthony J. Schmidhuber. “Acoustic modeling using deep belief networks. Hinton and Drew van Camp. Maas. Springer. Mohamed. Sainath. vol. J. Shafran. [6] Oriol Vinyals. 385. Rep. acoustic modeling in speech recognition. vol. “Framewise Phoneme action. Penn.G. 8. Morgan. jan. 2010. and A. Q. 6. pp. 45. Bunke. 1997.” IEEE Transactions on [20] Abdel rahman Mohamed. pp. “In- Neural Networks. 2012. Hochreiter and J. pp. mental CRF speech recognition system. K. Schraudolph. G. 743–749. Schmidhuber. 2012. Abdel-Hamid. and J. 14 –22. no. 9. 5–13. [17] David E. [9] A. Connnectionist Speech nal of Machine Learning Research. hybrid nn-hmm model for speech recognition. Williams. N. Graves. Graves and J. F. Nguyen.” [21] M. 1994. V. 1997. Supervised sequence labelling with recurrent neural networks.. Dahl. S. 2012. MIT Press. 2009. Suman Ravuri. pp. and G. Lehr and I. “Learning Precise Timing with LSTM Recurrent Networks. no.” IEEE [12] A. Dong Yu. A. and Signal Process- J. Hinton. “Bidirectional Recur- [2] Qifeng Zhu. vol. “Deep neural networks for Microsoft Research. G. [3] A. The DARPA TIMIT Acoustic-Phonetic sop. Berlin. Robinson.” Neural Networks. USA. “Offline Hand. T. and Andreas rent Neural Networks. H. [5] A. and language model for speech recognition. Fernández. Li Deng. pp. Heidelberg. 2009. Kluwer Academic 2002. Dahl. O. Liwicki. 1993. pp. [19] Andrew W. 696–699. [27] O. Ronald J. Schmidhuber. “Re- visiting Recurrent Neural Networks for Robust ASR.” in NIPS. REFERENCES [14] F. Speech. “Practical variational inference for neural networks. June/July 2005.E. Dong Yu.” Tech. Ng. [23] Geoffrey E. no.” Neural Networks. 298–305. 1995. joint acoustic. Recognition: A Hybrid Approach. Graves. Kingsbury. Paliwal. Speech. O’Neil.