Professional Documents
Culture Documents
com
Abstract
Deep neural networks (DNNs) have achieved great success in acoustic modeling for speech recognition. However, DNNs with sigmoid neurons
may suffer from the vanishing gradient problem during training. Maxout neurons are promising alternatives to sigmoid neurons. The activation
of a maxout neuron is obtained by selecting the maximum value within a local region, which results in constant gradients during the training
process. In this paper, we combine the maxout neurons with two popular DNN structures for acoustic modeling, namely the convolutional neural
network (CNN) and the long short-term memory (LSTM) recurrent neural network (RNN). The optimal network structures and training strategies
for the models are explored. Experiments are conducted on the benchmark data sets released under the IARPA Babel Program. The proposed
models achieve 2.5–6.0% relative improvements over their corresponding CNN or LSTM RNN baselines across six language collections. The
state-of-the-art results on these data sets are achieved after system combination.
© 2015 Elsevier B.V. All rights reserved.
Keywords: Maxout neuron; Convolutional neural network; Long short-term memory; Acoustic modeling; Speech recognition.
http://dx.doi.org/10.1016/j.specom.2015.12.003
0167-6393/© 2015 Elsevier B.V. All rights reserved.
54 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64
modeling the local structures of the input data. There are three has the maximum value and 0 otherwise. The maxout network
key features of CNNs. First, the lower layers of the CNN are has obtained state-of-the-art results in many data classification
split into a set of local receptive fields, so that different spatial tasks (Goodfellow et al., 2013). It has also shown good results
regions of the input data can be processed differently. Second, in acoustic modeling for speech recognition (Cai et al., 2013;
the weights connecting the receptive fields in adjacent layers are Miao et al., 2013).
often replicated for robust parameter estimation. Third, pooling In this paper, we aim at building DNN structures for acoustic
layers are often used in conjunction with the convolutional lay- modeling that make use of prior knowledge in the speech sig-
ers to reduce the signal variations. The CNN has been shown nals and at the same time solve the vanishing gradient problem.
to outperform DNNs in acoustic modeling, both in small vo- The solution we propose is to combine the maxout neurons with
cabulary tasks (Abdel-Hamid et al., 2012; Toth, 2013) and large both the CNN and the LSTM RNN. The convolutional structure
vocabulary tasks (Sainath et al., 2013a, b). and the LSTM recurrent structure of the networks capture infor-
The second aspect of exploiting prior knowledge of speech mation within and among the speech frames, while the maxout
signals is utilizing long-time dependencies among the speech neurons are able to solve the vanishing gradient problem. The
frames. Feed-forward neural networks discover cross-frame in- resulting models are referred to as the convolutional maxout
formation by concatenating input features in a context window. neural network (CMNN) and the recurrent maxout neural net-
However, the length of the context window has to be fixed. Some work (RMNN). This work greatly extends our previous works
works use the recurrent neural networks (RNNs) (Vinyals et al., (Cai et al., 2013, 2014) in four aspects. First, the RMNN model
2012; Weng et al., 2014, Saon et al., 2014) to capture longer is proposed to capture long-term dependencies in speech, which
context and report better performance than feed-forward DNNs, has not been adopted previously. Second, we try various alter-
especially for noise robust tasks. But the time dependencies native network structures and training strategies to explore the
learned by standard RNNs are still limited due to the vanishing optimal configuration. Third, extensive experiments are carried
and exploding gradient problem (Bengio et al., 1994). Recent out on the Babel benchmark data sets, including six languages.
innovations in acoustic modeling explore the use of long short- Fourth, the system combination results for the CMNNs and the
term memory (LSTM) RNNs (Gers et al., 2002). The recurrent RMNNs are presented, which achieve state-of-the-art perfor-
cells in LSTM RNNs are linearly connected in order to solve the mances on these data sets.
vanishing gradient problem. The input signals, recurrent signals The remainder of this paper is organized as follows. In
and output signals in LSTM RNNs are controlled by gate signals Section 2, a description of the baseline CNN and LSTM RNN
to achieve precise timing. The LSTM RNN-based systems have is given. In Section 3, we introduce the CMNN and the RMNN
achieved state-of-the-art results on the TIMIT phone recognition acoustic models. In Section 4, we give our experimental details
task in 2013 (Graves et al., 2013a). Then LSTM RNN-HMM hy- and results. Finally, the conclusions are presented in Section 5.
brid models are applied to large vocabulary tasks (Graves et al.,
2013b). After that, researchers in Google extend the LSTM RNN
acoustic model to big data and achieve impressive results on 2. Baseline CNN and LSTM RNN
voice search tasks (Sak et al., 2014a, 2014b). The good per-
formances of LSTM RNNs on various tasks demonstrate their In this work, the CNN acoustic models and the LSTM RNN
ability to learn long-time dependencies from data. acoustic models are employed as the baseline systems.
To solve the vanishing gradient problem of the sigmoid neu-
rons, previous solutions mostly fall into two categories. The
first solution is to use pre-training methods for the sigmoid neu- 2.1. CNN acoustic model
rons. Existing pre-training algorithms include deep belief net-
work (DBN) pre-training (Hinton et al., 2006), denoising au- The CNN aims at discovering local structures in the input
toencoders (Vincent et al., 2008), discriminative pre-training data. The speech signal exhibits local similarities in both the
(Seide et al., 2011), etc. These methods aim at finding a good spectral and temporal dimension, making the application of
initialization for DNNs, so that the final model will have better CNN acoustic models possible. An example of the CNN applied
convergence with SGD training. However, pre-training methods to speech signals is illustrated in Fig. 1. In the figure, the light
introduce more computational cost and they are not directly ap- and shade regions along the frequency axis represent the speech
plicable to neural network architectures other than feed-forward formants. The CNN has a set of locally-connected weights in
DNNs, e.g. CNNs or LSTM RNNs. The other solution to the the lower convolutional layers to model different regions in the
vanishing gradient problem is to use neuron types that have con- speech spectrogram.
stant gradients. Rectified linear unit (ReLU) (Nair and Hinton, To apply the CNN acoustic model for speech signals, the
2010; Glorot et al., 2011) is a kind of neuron that produces spectral structures of the speech features have to be preserved.
constant gradients. The activation function of the ReLU has the The Mel filterbanks or the power spectrum features (Sainath
form y = max(x, 0), so its gradient is either 1 or 0, depending et al., 2013) are good choices for the CNN. The input fea-
on the input value. Later, the maxout neuron is proposed for tures are first divided into N non-overlapping frequency bands
machine learning tasks (Goodfellow et al., 2013). This neuron {vi |i ∈ 0, . . . , N − 1}. A group of bands {vi+r |r ∈ 0, . . . , s − 1}
type achieves its nonlinearity by selecting the maximum value produces the activations hi through the weight connections and
within a group of linear pieces, so its gradient is 1 if the piece nonlinear transforms. If full weight-sharing is used, then hi is
M. Cai, J. Liu / Speech Communication 77 (2016) 53–64 55
computed as
s−1
Fig. 2. An illustration of the LSTM memory block. The red lines stand for
hi = θ WrT · vi+r + b , (1) the connections with weights. The dashed lines stand for recurrent connections.
r=0
(For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article).
where {Wr |r ∈ 0, . . . , s − 1} and b are weights and biases repli-
cated in the convolutional layer, θ is the nonlinear transform.
The s is called band width. The r is the index of the bands. The not go through nonlinear transforms, the gradients will be back-
weights and biases shift d bands every time, producing a group propagated smoothly through the recurrent connections using
of activation bands {h j | j ∈ 0 . . . M − 1}. The d is called band SGD. In this way the model can learn long-time dependencies.
shift, which is typically 1. The meanings of band width s and In addition to the forget gate, there are also the input gate it
band shift d are also illustrated in Fig. 1. and the output gate ot in the LSTM RNN to control the input and
The max-pooling layer is often used after the convolutional output signals. Let xt and yt be the input and output signal for an
layer to reduce the variance of frequency shift. The form of the LSTM network at time t. Let ct and mt be the cell activation and
max-pooling is the output activation, respectively. The forward pass from xt to
yt follows the equations according to Gers et al. (2002), Graves
pmi = max hmj , (2) et al. (2013a) and Sak et al. (2014a):
j∈i·k,...,(i+1)·k−1
where pmi is the output of the max-pooling operation, i and j are it = σ (Wix xt + Wim mt−1 + Wic ct−1 + bi ) (3)
the indexes of the bands, m is the index of the neurons in a band ft = σ (W f x xt + W f m mt−1 + W f c ct−1 + b f ) (4)
and k is the pooling size.
After several layers of convolution or max-pooling, the bands ct = ft ct−1 + it g(Wcx xt + Wcm mt−1 + bc ) (5)
are concatenated and fully-connected weights are applied just
like conventional DNNs. ot = σ (Wox xt + Wom mt−1 + Woc ct + bo ) (6)
mt = ot h(ct ) (7)
2.2. LSTM RNN acoustic model
yt = φ(Wym mt + by ) (8)
RNNs have the advantage of modeling longer context than
DNNs. The speech signals are non-stationary time series, the where Wcx , Wix , Wfx and Wox are the weights connected to
context information is important for speech recognition. Con- the LSTM inputs, Wcm , Wim , Wfm and Wom are the weights
ventional RNNs have cyclic connections in the hidden layers connected to the LSTM activations, bc , bi , bf and bo are the
to model temporal correlations. But modeling long-time depen- biases, Wic , Wfc and Woc are diagonal peephole connections
dencies using conventional RNN is difficult due to the vanish- from the cell to the gate signals. Wym and by are the weights
ing and exploding gradients in the SGD training (Bengio et al., and biases in the final output layer. The σ , g and h are nonlinear
1994). The LSTM RNN is an elegant solution to model long-time functions. The sigmoid function is typically used for σ and the
dependencies. tanh function is typically used for g and h. The is element-
The architecture of the LSTM RNN is depicted in Fig. 2. wise product of the vectors and the φ is the softmax function in
The major difference of the LSTM RNN from conventional the output layer.
RNN is that the recurrent connections are linear in the LSTM The standard LSTM RNN has one hidden layer. Later, deep
RNN. This connection is controlled by the signal ft called for- LSTM RNNs are proved to outperform the standard LSTM RNN
get gate. Because the recurrent connections in LSTM RNN do for speech recognition (Graves et al., 2013a; Sak et al., 2014a).
56 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64
In the SGD training stage, the gradient for the maxout neuron
is computed as
∂hli 1 if zli j ≥ zlis , ∀s ∈ 1, . . . ,k
= (11)
∂zli j 0 otherwise
This equation shows that the gradient of the maxout neuron is 1
for the piece with the maximum activation, and 0 otherwise. By
producing constant gradients during SGD training, the vanishing
gradient problem is naturally solved in the maxout networks.
Therefore the optimization of deep maxout neural networks is
easier compared to conventional sigmoid neural networks.
Table 1
The corpus size and characteristics of the languages.
the results published in various other works (Cui et al., 2013; Table 4
Karafiat et al., 2013; Tsakalidis et al., 2014; Chen et al., 2014). TERs (%) of different dropout strategies for the CNN models and the CMNN
models for Vietnamese.
Fig. 6. A performance comparison of different acoustic models applied to all the languages.
Table 7 Table 8
A comparison of RNNs with different structures. The TERs (%) are for Viet- TERs (%) of the LSTM, the RMNN and the DRNN models for all the languages.
namese.
Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil
Model Structure TER
LSTM 40.7 50.5 47.4 47.9 47.8 65.0
LSTM 3 LSTMP 47.8 RMNN 39.0 48.1 44.9 45.7 46.0 63.4
RMNN 1 LSTMP + 5 maxout fc 46.9 DRNN 45.2 51.3 49.5 50.8 50.0 68.1
RMNN 2 LSTMP + 4 maxout fc 46.0
DRNN 1 LSTMP + 2 sigmoid fc 50.5
DRNN 2 LSTMP + 2 sigmoid fc 50.0
the best network structure explored before, i.e., the RMNN mod-
DRNN 1 LSTMP + 5 ReLU fc 47.8
els with 2 recurrent hidden layers and 4 maxout fully-connected
layers. The standard LSTM models with 3 recurrent layers and
sigmoid layers. The second model has 2 LSTMP recurrent lay- the DRNN models with 2 recurrent layers and 2 sigmoid fully-
ers and 2 sigmoid layers. The LSTMP layers have 1024 cells connected layers are also trained. The hyperparameters for the
and a linear projection of 768 dimensions. The sigmoid layers BPTT are the same as those in the previous experiments except
have 1500 neurons. The hyperparameters for the BPTT training for Cantonese, for which we use a dropout rate r of 0.1 as there
for these two models are the same as those for the LSTMP RNN is more data. The results are given in Table 8.
except for an initial learning rate of 0.000625 per frame. The The results show that the RMNN models achieve good per-
third model has 1 LSTMP recurrent layer and 5 ReLU layers. formances for different languages. The token error rates of the
The LSTMP layer has 1024 cells and a linear projection of 512 RMNN models are 2.5–5.3% relatively lower than the LSTM
dimensions. The ReLU layer has 1500 neurons. The hyperpa- models. The DRNNs, however, only show degraded perfor-
rameters for the BPTT training for this model are the same as mances compared to the LSTM models. The experiments for
those for the RMNN. all the languages demonstrate the effectiveness of the RMNN
The results of the RNN acoustic models are presented in models. We also give the performance comparison of differ-
Table 7. The results show that the TER of the LSTM model ent acoustic models in Fig. 6. The figure clearly shows that the
is 47.8%, which is a very strong baseline. Both the RMNNs proposed RMNN acoustic models achieve the best performance
with 1 or 2 recurrent hidden layers achieve better results than compared to other models for all the languages, outperforming
the LSTM. Results also show that the RMNN with 2 recur- the baseline DNNs by 4.9–13.4% relatively.
rent hidden layers has the best performance, achieving a TER of
46.0%. This result is 3.8% better than the LSTM relatively. How- 4.5. Training time
ever, the DRNNs with fully-connected sigmoid neurons do not
achieve gains over the baseline. The performance of the DRNN Apart from the token error rates, another matter to take into
with fully-connected ReLU neurons is on par with the baseline consideration is the training time. We use highly efficient GPU
LSTMP RNN. The results indicate that it is not trivial to build implementations for all the models in our experiments. The train-
deep RNNs. The better optimization performance of the maxout ing time per epoch and the number of required epochs for all the
neurons and the better model averaging with the dropout training models are given in Tables 9 and 10, respectively. The experi-
are beneficial for deep models. ments are carried out on NVIDIA Tesla K20m GPUs.
From the training time presented in Tables 9 and 10, we can
4.4.2. Results for all the languages see that the baseline DNN models cost the least training time.
The RMNN models give good results for Vietnamese. It is The training time per epoch for the CNN models is between
more convincing to see how the RMNN models perform for other 3 and 4 times slower than the DNNs. This is because the local
languages. We conduct experiments for all the 6 languages using weight connections in the convolutional structures require many
62 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64
Table 9
Training time per epoch (hours) of different acoustic models for all the languages.
a
Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil
DNN 2.35 1.32 1.12 1.43 1.22 1.04
CNN 7.07 4.61 3.83 4.98 4.22 3.70
CNN, dropout 7.13 4.68 3.90 5.04 4.37 3.79
CMNN 7.09 4.62 4.05 4.87 4.30 3.13
LSTM 5.33 3.17 3.07 3.44 3.22 2.93
RMNN 4.78 2.93 2.81 3.11 3.00 2.19
Table 10
Number of required training epochs of different acoustic models for all the
languages.
Table 11
TERs (%) for system combination. The optimal results are shown in bold face.
to the CNN. The proposed CMNN models and RMNN models the CMNN is about 3–4 times slower than the DNN of the same
are also combined. The results are better than previous com- size. The training time per epoch for the RMNN is about 2–3
binations for Pashto, Turkish, Tagalog and Tamil, but a little times slower than the DNN of the same size. (5) System combi-
worse than the LSTM+RMNN results for Cantonese and Viet- nations using the CNN, the LSTM, the CMNN and the RMNN
namese. We believe this is because both the Cantonese and the models give state-of-the-art results for all the 6 languages. In the
Vietnamese are tonal languages. The pitch in tonal languages future we want to do multilingual training (Knill et al., 2013).
is relatively stable. So that the speech in tonal languages ex- We would also like to try more sophisticated CNN structures
hibits longer frame dependencies than the speech in non-tonal (Abdel-Hamid et al., 2014; Sainath et al., 2015) and try to de-
languages. The recurrent structures in the LSTMP RNNs and sign new models that combine the advantages of both the CMNN
the RMNNs are suitable to model long frame dependencies. models and the RMNN models.
Then we combine three baseline systems, i.e., the DNN, the
CNN and the LSTM models. We also combine the DNNs with
Acknowledgments
the CMNNs and RMNNs for comparison. The results show that
the latter combinations achieve relative improvements of 2.2%
The authors would like to thank anonymous reviewers for
to 3.3% over the former combinations. These results again con-
improving the paper. This work is supported by National Nat-
firm the effectiveness of the proposed models compared to the
ural Science Foundation of China under Grant Nos. 61273268,
baselines. We further try to combine 4 systems and 5 systems.
61370034, 61403224 and 61005017.
The best results are shown in bold face in Table 11. To the au-
thors’ knowledge, these are the state-of-the-art results on the
data sets. References
5. Conclusions Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G., 2012. Applying convolu-
tional neural networks concepts to hybrid NN-HMM model for speech recog-
nition. In: Proceedings of International Conference on Acoustics, Speech and
In this paper, we extend our previous work on the convolu- Signal Processing. ICASSP. Kyoto, Japan, pp. 4277–4280.
tional maxout neural network (CMNN) acoustic model and pro- Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D., 2014.
pose the new recurrent maxout neural network (RMNN) acoustic Convolutional neural networks for speech recognition. IEEE/ACM Trans.
model. Both the models take advantage of the maxout neurons, Audio Speech Lang. Process. 22 (10), 1533–1545.
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies
while they model the prior knowledge of the speech signals
with gradient descent is difficult. IEEE Trans. Neural Netw. 5 (2), 157–166.
in different aspects. We compare alternative model structures Cai, M., Shi, Y., Liu, J., 2013. Deep maxout neural networks for speech recogni-
and explore various dropout strategies for the CMNNs and the tion. In: Proceedings of Automatic Speech Recognition and Understanding.
RMNNs. The performances of the models are evaluated for 6 ASRU. Olomouc, Czech Republic, pp. 291–296.
different languages on the IARPA Babel data sets. System com- Cui, J., Cui, X., Ramabhadran, B., Kim, J., Kingsbury, B., Mamou, J., Mangu, L.,
binations are carried out using different models. Our main dis- Picheny, M., Sainath, T.N., Sethy, A., 2013. Developing speech recognition
systems for corpus indexing under the IARPA Babel program. In: Proceed-
coveries include: (1) Dropout training works well for both the ings of International Conference on Acoustics, Speech and Signal Process-
CMNN and the RMNN. The optimal strategy is to apply dropout ing. ICASSP. Vancouver, Canada, pp. 6753–6757.
to all the fully-connected layers of the models. (2) Replacing the Cai, M., Shi, Y., Kang, J., Liu, J., Su, T., 2014. Convolutional maxout neural net-
maxout neurons by the sigmoid neurons in the convolutional part works for low-resource speech recognition. In: Proceedings of International
Symposium on Chinese Spoken Language Processing. ISCSLP. Singapore,
of the CMNNs or the fully-connected part of the RMNNs only
pp. 133–137.
results in degraded performance. (3) The relative TER reductions Chen, N.F., Sivadas, S., Lim, B.P., Ngo, H.G., Xu, H., Pham, V.T., Ma, B.,
of the CMNNs over the CNNs are between 4.3% and 6.0%. The Li, H., 2014. Strategies for Vietnamese keyword search. In: Proceedings
relative TER reductions of the RMNNs over the LSTM RNNs of International Conference on Acoustics, Speech and Signal Processing.
are between 2.5% and 5.3%. (4) The training time per epoch for ICASSP. Florence, Italy, pp. 4121–4125.
64 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64
Dahl, G.E., Yu, D., Deng, L., Acero, A., 2012. Context-dependent pre-trained Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B., 2013a. Deep
deep neural networks for large-vocabulary speech recognition. IEEE Trans. convolutional neural networks for LVCSR. In: Proceedings of International
Audio Speech Lang. Process. 20 (1), 30–42. Conference on Acoustics, Speech and Signal Processing. ICASSP. Vancou-
Deng, L., Huang, X., 2004. Challenges in adopting speech recognition. Com- ver, Canada, pp. 8614–8618.
mun. ACM 47 (1), 69–75. Sainath, T.N., Kingsbury, B., Mohamed, A., Dahl, G.E., Saon, G., Soltau, H.,
Gers, F.A., Schraudolph, N.N., Schmidhuber, J., 2002. Learning precise timing Beran, T., Aravkin, A., Ramabhadran, B., 2013b. Improvements to deep
with LSTM recurrent networks. J. Mach. Learn. Res. 3, 115–143. convolutional neural networks for LVCSR. In: Proceedings of Automatic
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Speech Recognition and Understanding. ASRU. Olomouc, Czech Republic,
Khudanpur, S., 2014. A pitch extraction algorithm tuned for automatic pp. 315–320.
speech recognition. In: Proceedings of International Conference on Acous- Sak, H., Senior, A., Beaufays, F., 2014a. Long short-term memory recurrent neu-
tics, Speech and Signal Processing. ICASSP. Florence, Italy, pp. 2494–2498. ral network architectures for large scale acoustic modeling. In: Proceedings
Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feed- of Interspeech. Singapore, pp. 338–342.
forward neural networks. J. Mach. Learn. Res. 9, 249–256. Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R.,
Glorot, X., Bordes, A., Bengio, Y., 2011. Deep sparse rectifier neural networks. Mao, M., 2014b. Sequence discriminative distributed training of long short-
J. Mach. Learn. Res. 15, 315–323. term memory recurrent neural networks. In: Proceedings of Interspeech.
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y., 2013. Singapore, pp. 1209–1213.
Maxout networks. In: Proceedings of International Conference on Machine Sainath, T.N., Kingsbury, B., Mohamed, A., Ramabhadran, B., 2013. Learning
Learning. ICML. Atlanta, USA, pp. 2356–2364. filter banks within a deep neural network framework. In: Proceedings of
Graves, A., Mohamed, A., Hinton, G.E., 2013a. Speech recognition with deep Automatic Speech Recognition and Understanding. ASRU. Olomouc, Czech
recurrent neural networks. In: Proceedings of International Conference on Republic, pp. 297–302.
Acoustics, Speech and Signal Processing. ICASSP. Vancouver, Canada, Sainath, T.N., Kingsbury, B., Saon, G., Soltau, H., Mohamed, A., Dahl, G.E.,
pp. 6645–6649. Ramabhadran, B., 2015. Deep convolutional neural networks for large-scale
Graves, A., Jaitly, N., Mohamed, A., 2013b. Hybrid speech recognition with deep speech tasks. Neural Netw. 64, 39–48.
bidirectional LSTM. In: Proceedings of Automatic Speech Recognition and Saon, G., Soltau, H., Emami, A., Picheny, M., 2014. Unfolded recurrent neural
Understanding. ASRU. Olomouc, Czech Republic, pp. 273–278. networks for speech recognition. In: Proceedings of Interspeech. Singapore,
Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A fast learning algorithm for deep pp. 343–347.
belief nets. Neural Comput. 18 (7), 1527–1554. Seide, F., Li, G., Yu, D., 2011. Conversational speech transcription using context-
Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., dependent deep neural networks. In: Proceedings of Interspeech. Florence,
Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B., 2012. Deep neural Italy, pp. 437–440.
networks for acoustic modeling in speech recognition: the shared views of Seide, F., Li, G., Chen, X., Yu, D., 2011. Feature engineering in context-
four research groups. IEEE Signal Process. Mag. 29 (6), 82–97. dependent deep neural networks for conversational speech transcription. In:
Karafiat, M., Grezl, F., Hannemann, M., Vesely, K., Cernocky, J.H., 2013. BUT Proceedings of Automatic Speech Recognition and Understanding. ASRU.
Babel system for spontaneous Cantonese. In: Proceedings of Interspeech. Waikoloa, USA, pp. 24–29.
Lyon, France, pp. 2589–2593. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.,
Knill, K.M., Gales, M.J.F., Rath, S.P., Woodland, P.C., Zhang, C., Zhang, S.-X., 2014. Dropout: a simple way to prevent neural networks from overfitting. J.
2013. Investigation of multilingual deep neural networks for spoken term Mach. Learn. Res. 15, 1929–1958.
detection. In: Proceedings of Automatic Speech Recognition and Under- Toth, L., 2013. Convolutional deep rectifier neural nets for phone recognition.
standing. ASRU. Olomouc, Czech Republic, pp. 138–143. In: Proceedings of Interspeech. Lyon, France, pp. 1722–1726.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning Toth, L., 2014. Convolutional deep maxout networks for phone recognition. In:
applied to document recognition. Proc. IEEE 86 (11), 2278–2324. Proceedings of Interspeech. Singapore, pp. 1078–1082.
Li, J., Wang, X., Xu, B., 2013. Understanding the dropout strategy and analyzing Tsakalidis, S., Hsiao, R., Karakos, D., Ng, T., Ranjan, S., Saikumar, G.,
its effectiveness on LVCSR. In: Proceedings of International Conference Zhang, L., Nguyen, L., Schwartz, R., Makhoul, J., 2014. The 2013 BBN
on Acoustics, Speech and Signal Processing. ICASSP. Vancouver, Canada, Vietnamese telephone speech keyword spotting system. In: Proceedings
pp. 7614–7618. of International Conference on Acoustics, Speech and Signal Processing.
Miao, Y., Metze, F., Rawat, S., 2013. Deep maxout networks for low-resource ICASSP. Florence, Italy, pp. 7829–7833.
speech recognition. In: Proceedings of Automatic Speech Recognition and Vincent, P., Larochelle, H., Bengio, Y., Manzegol, P.-A., 2008. Extracting and
Understanding. ASRU. Olomouc, Czech Republic, pp. 398–403. composing robust features with denoising autoencoders. In: Proceedings of
Miao, Y., Metze, F., 2014. Improving language-universal feature extraction with International Conference on Machine Learning. ICML. Helsinki, Finland,
deep maxout and convolutional neural networks. In: Proceedings of Inter- pp. 1096–1103.
speech. Singapore, pp. 800–804. Vinyals, O., Ravuri, S.V., Povey, D., 2012. Revisiting recurrent neural networks
Mohamed, A., Dahl, G.E., Hinton, G.E., 2012. Acoustic modeling using deep for robust ASR. In: Proceedings of International Conference on Acoustics,
belief networks. IEEE Trans. Audio Speech Lang. Process. 20 (1), 14– Speech and Signal Processing. ICASSP. Kyoto, Japan, pp. 4085–4088.
22. Weng, C., Yu, D., Watanabe, S., Juang, B.-H., 2014. Recurrent deep neural
Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted Boltz- networks for robust speech recognition. In: Proceedings of International
mann machines. In: Proceedings of International Conference on Machine Conference on Acoustics, Speech and Signal Processing. ICASSP. Florence,
Learning. ICML. Haifa, Israel, pp. 807–814. Italy, pp. 5532–5536.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Xu, H., Povey, D., Mangu, L., Zhu, J., 2011. Minimum Bayes Risk decoding and
Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., system combination based on a recursion for edit distance. Comput. Speech
Stemmer, G., Vesely, K., 2011. The Kaldi speech recognition toolkit. In: Lang. 25 (4), 802–828.
Proceedings of Automatic Speech Recognition and Understanding. ASRU. Zaremba, W., Sutskever, I., Vinyals, O., Recurrent neural network regularization,
Waikoloa, USA. arXiv:1409.2329 (2014).