You are on page 1of 12

Available online at www.sciencedirect.

com

Speech Communication 77 (2016) 53–64


www.elsevier.com/locate/specom

Maxout neurons for deep convolutional and LSTM neural networks


in speech recognition
Meng Cai∗, Jia Liu
Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Received 1 July 2015; received in revised form 7 November 2015; accepted 7 December 2015
Available online 17 December 2015

Abstract
Deep neural networks (DNNs) have achieved great success in acoustic modeling for speech recognition. However, DNNs with sigmoid neurons
may suffer from the vanishing gradient problem during training. Maxout neurons are promising alternatives to sigmoid neurons. The activation
of a maxout neuron is obtained by selecting the maximum value within a local region, which results in constant gradients during the training
process. In this paper, we combine the maxout neurons with two popular DNN structures for acoustic modeling, namely the convolutional neural
network (CNN) and the long short-term memory (LSTM) recurrent neural network (RNN). The optimal network structures and training strategies
for the models are explored. Experiments are conducted on the benchmark data sets released under the IARPA Babel Program. The proposed
models achieve 2.5–6.0% relative improvements over their corresponding CNN or LSTM RNN baselines across six language collections. The
state-of-the-art results on these data sets are achieved after system combination.
© 2015 Elsevier B.V. All rights reserved.

Keywords: Maxout neuron; Convolutional neural network; Long short-term memory; Acoustic modeling; Speech recognition.

1. Introduction consistent improvements as the DNN acoustic models (Deng


and Huang, 2004).
Automatic speech recognition (ASR) is undergoing a very The most widely used DNN model is the feed-forward DNN
fast developing period in recent years. The application of deep with fully-connected weights and sigmoid neurons. Despite its
neural networks (DNNs) in acoustic modeling is a strong force simplicity, there are two issues that might be the drawback of
in driving this development. Traditionally, acoustic models are this model. First, the feed-forward fully-connected DNNs do
dominated by Gaussian mixture model-hidden Markov mod- not have structures that explicitly model the prior knowledge
els (GMM-HMMs). The initial success of DNN-based acoustic of speech signals. The prior knowledge includes the local prop-
models is on the phone recognition task on the TIMIT corpus erties within the speech frames, long-time dependencies among
(Mohamed et al., 2012). Then DNNs quickly scale up to large the speech frames, etc. Second, the sigmoid neurons of the DNN
vocabulary continuous speech recognition (Dahl et al., 2012). may cause the vanishing gradient problem during the stochastic
Compared to conventional GMM-based systems, the error rates gradient descent (SGD) training (Glorot and Bengio, 2010). This
of DNN-based systems drop by as large as 33% relatively on is because the gradient of the sigmoid function tends to be very
the well-known Switchboard phone-call transcription bench- small if its input value is not around zero. The two issues have
mark (Seide et al., 2011). These results are soon confirmed and attracted attentions of researchers in both the speech community
extended by many research groups (Hinton et al., 2012). Few and the machine learning community.
technologies in the history of ASR have brought such large and Previous works on exploiting prior knowledge of speech sig-
nals for DNNs mainly focus on two aspects. The first aspect
is discovering local properties within the speech frames. The
∗ Corresponding author. speech signals exhibit some well-known local properties in the
E-mail addresses: cai-m10@mails.tsinghua.edu.cn, caimeng06@foxmail. spectrum, such as the speech formants. The convolutional neural
com (M. Cai), liuj@tsinghua.edu.cn (J. Liu). network (CNN) (LeCun et al., 1998) is proved to be successful at

http://dx.doi.org/10.1016/j.specom.2015.12.003
0167-6393/© 2015 Elsevier B.V. All rights reserved.
54 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64

modeling the local structures of the input data. There are three has the maximum value and 0 otherwise. The maxout network
key features of CNNs. First, the lower layers of the CNN are has obtained state-of-the-art results in many data classification
split into a set of local receptive fields, so that different spatial tasks (Goodfellow et al., 2013). It has also shown good results
regions of the input data can be processed differently. Second, in acoustic modeling for speech recognition (Cai et al., 2013;
the weights connecting the receptive fields in adjacent layers are Miao et al., 2013).
often replicated for robust parameter estimation. Third, pooling In this paper, we aim at building DNN structures for acoustic
layers are often used in conjunction with the convolutional lay- modeling that make use of prior knowledge in the speech sig-
ers to reduce the signal variations. The CNN has been shown nals and at the same time solve the vanishing gradient problem.
to outperform DNNs in acoustic modeling, both in small vo- The solution we propose is to combine the maxout neurons with
cabulary tasks (Abdel-Hamid et al., 2012; Toth, 2013) and large both the CNN and the LSTM RNN. The convolutional structure
vocabulary tasks (Sainath et al., 2013a, b). and the LSTM recurrent structure of the networks capture infor-
The second aspect of exploiting prior knowledge of speech mation within and among the speech frames, while the maxout
signals is utilizing long-time dependencies among the speech neurons are able to solve the vanishing gradient problem. The
frames. Feed-forward neural networks discover cross-frame in- resulting models are referred to as the convolutional maxout
formation by concatenating input features in a context window. neural network (CMNN) and the recurrent maxout neural net-
However, the length of the context window has to be fixed. Some work (RMNN). This work greatly extends our previous works
works use the recurrent neural networks (RNNs) (Vinyals et al., (Cai et al., 2013, 2014) in four aspects. First, the RMNN model
2012; Weng et al., 2014, Saon et al., 2014) to capture longer is proposed to capture long-term dependencies in speech, which
context and report better performance than feed-forward DNNs, has not been adopted previously. Second, we try various alter-
especially for noise robust tasks. But the time dependencies native network structures and training strategies to explore the
learned by standard RNNs are still limited due to the vanishing optimal configuration. Third, extensive experiments are carried
and exploding gradient problem (Bengio et al., 1994). Recent out on the Babel benchmark data sets, including six languages.
innovations in acoustic modeling explore the use of long short- Fourth, the system combination results for the CMNNs and the
term memory (LSTM) RNNs (Gers et al., 2002). The recurrent RMNNs are presented, which achieve state-of-the-art perfor-
cells in LSTM RNNs are linearly connected in order to solve the mances on these data sets.
vanishing gradient problem. The input signals, recurrent signals The remainder of this paper is organized as follows. In
and output signals in LSTM RNNs are controlled by gate signals Section 2, a description of the baseline CNN and LSTM RNN
to achieve precise timing. The LSTM RNN-based systems have is given. In Section 3, we introduce the CMNN and the RMNN
achieved state-of-the-art results on the TIMIT phone recognition acoustic models. In Section 4, we give our experimental details
task in 2013 (Graves et al., 2013a). Then LSTM RNN-HMM hy- and results. Finally, the conclusions are presented in Section 5.
brid models are applied to large vocabulary tasks (Graves et al.,
2013b). After that, researchers in Google extend the LSTM RNN
acoustic model to big data and achieve impressive results on 2. Baseline CNN and LSTM RNN
voice search tasks (Sak et al., 2014a, 2014b). The good per-
formances of LSTM RNNs on various tasks demonstrate their In this work, the CNN acoustic models and the LSTM RNN
ability to learn long-time dependencies from data. acoustic models are employed as the baseline systems.
To solve the vanishing gradient problem of the sigmoid neu-
rons, previous solutions mostly fall into two categories. The
first solution is to use pre-training methods for the sigmoid neu- 2.1. CNN acoustic model
rons. Existing pre-training algorithms include deep belief net-
work (DBN) pre-training (Hinton et al., 2006), denoising au- The CNN aims at discovering local structures in the input
toencoders (Vincent et al., 2008), discriminative pre-training data. The speech signal exhibits local similarities in both the
(Seide et al., 2011), etc. These methods aim at finding a good spectral and temporal dimension, making the application of
initialization for DNNs, so that the final model will have better CNN acoustic models possible. An example of the CNN applied
convergence with SGD training. However, pre-training methods to speech signals is illustrated in Fig. 1. In the figure, the light
introduce more computational cost and they are not directly ap- and shade regions along the frequency axis represent the speech
plicable to neural network architectures other than feed-forward formants. The CNN has a set of locally-connected weights in
DNNs, e.g. CNNs or LSTM RNNs. The other solution to the the lower convolutional layers to model different regions in the
vanishing gradient problem is to use neuron types that have con- speech spectrogram.
stant gradients. Rectified linear unit (ReLU) (Nair and Hinton, To apply the CNN acoustic model for speech signals, the
2010; Glorot et al., 2011) is a kind of neuron that produces spectral structures of the speech features have to be preserved.
constant gradients. The activation function of the ReLU has the The Mel filterbanks or the power spectrum features (Sainath
form y = max(x, 0), so its gradient is either 1 or 0, depending et al., 2013) are good choices for the CNN. The input fea-
on the input value. Later, the maxout neuron is proposed for tures are first divided into N non-overlapping frequency bands
machine learning tasks (Goodfellow et al., 2013). This neuron {vi |i ∈ 0, . . . , N − 1}. A group of bands {vi+r |r ∈ 0, . . . , s − 1}
type achieves its nonlinearity by selecting the maximum value produces the activations hi through the weight connections and
within a group of linear pieces, so its gradient is 1 if the piece nonlinear transforms. If full weight-sharing is used, then hi is
M. Cai, J. Liu / Speech Communication 77 (2016) 53–64 55

Fig. 1. An illustration of the CNN applied to Mel filterbank features of the


speech signals.

computed as
 s−1 
 Fig. 2. An illustration of the LSTM memory block. The red lines stand for
hi = θ WrT · vi+r + b , (1) the connections with weights. The dashed lines stand for recurrent connections.
r=0
(For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article).
where {Wr |r ∈ 0, . . . , s − 1} and b are weights and biases repli-
cated in the convolutional layer, θ is the nonlinear transform.
The s is called band width. The r is the index of the bands. The not go through nonlinear transforms, the gradients will be back-
weights and biases shift d bands every time, producing a group propagated smoothly through the recurrent connections using
of activation bands {h j | j ∈ 0 . . . M − 1}. The d is called band SGD. In this way the model can learn long-time dependencies.
shift, which is typically 1. The meanings of band width s and In addition to the forget gate, there are also the input gate it
band shift d are also illustrated in Fig. 1. and the output gate ot in the LSTM RNN to control the input and
The max-pooling layer is often used after the convolutional output signals. Let xt and yt be the input and output signal for an
layer to reduce the variance of frequency shift. The form of the LSTM network at time t. Let ct and mt be the cell activation and
max-pooling is the output activation, respectively. The forward pass from xt to
yt follows the equations according to Gers et al. (2002), Graves
pmi = max hmj , (2) et al. (2013a) and Sak et al. (2014a):
j∈i·k,...,(i+1)·k−1

where pmi is the output of the max-pooling operation, i and j are it = σ (Wix xt + Wim mt−1 + Wic ct−1 + bi ) (3)
the indexes of the bands, m is the index of the neurons in a band ft = σ (W f x xt + W f m mt−1 + W f c ct−1 + b f ) (4)
and k is the pooling size.
After several layers of convolution or max-pooling, the bands ct = ft  ct−1 + it  g(Wcx xt + Wcm mt−1 + bc ) (5)
are concatenated and fully-connected weights are applied just
like conventional DNNs. ot = σ (Wox xt + Wom mt−1 + Woc ct + bo ) (6)

mt = ot  h(ct ) (7)
2.2. LSTM RNN acoustic model
yt = φ(Wym mt + by ) (8)
RNNs have the advantage of modeling longer context than
DNNs. The speech signals are non-stationary time series, the where Wcx , Wix , Wfx and Wox are the weights connected to
context information is important for speech recognition. Con- the LSTM inputs, Wcm , Wim , Wfm and Wom are the weights
ventional RNNs have cyclic connections in the hidden layers connected to the LSTM activations, bc , bi , bf and bo are the
to model temporal correlations. But modeling long-time depen- biases, Wic , Wfc and Woc are diagonal peephole connections
dencies using conventional RNN is difficult due to the vanish- from the cell to the gate signals. Wym and by are the weights
ing and exploding gradients in the SGD training (Bengio et al., and biases in the final output layer. The σ , g and h are nonlinear
1994). The LSTM RNN is an elegant solution to model long-time functions. The sigmoid function is typically used for σ and the
dependencies. tanh function is typically used for g and h. The  is element-
The architecture of the LSTM RNN is depicted in Fig. 2. wise product of the vectors and the φ is the softmax function in
The major difference of the LSTM RNN from conventional the output layer.
RNN is that the recurrent connections are linear in the LSTM The standard LSTM RNN has one hidden layer. Later, deep
RNN. This connection is controlled by the signal ft called for- LSTM RNNs are proved to outperform the standard LSTM RNN
get gate. Because the recurrent connections in LSTM RNN do for speech recognition (Graves et al., 2013a; Sak et al., 2014a).
56 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64

In the SGD training stage, the gradient for the maxout neuron
is computed as

∂hli 1 if zli j ≥ zlis , ∀s ∈ 1, . . . ,k
= (11)
∂zli j 0 otherwise
This equation shows that the gradient of the maxout neuron is 1
for the piece with the maximum activation, and 0 otherwise. By
producing constant gradients during SGD training, the vanishing
gradient problem is naturally solved in the maxout networks.
Therefore the optimization of deep maxout neural networks is
easier compared to conventional sigmoid neural networks.

Fig. 3. An illustration of the maxout neurons. 3.2. Dropout training

Because the maxout neurons have better optimization per-


Moreover, linear recurrent projection layers are found to be use-
formance, they can effectively deal with the problem of un-
ful for reducing the number of parameters at no loss of accu-
derfitting. However, neural networks with maxout neurons are
racy (Sak et al., 2014a). The resulting model is referred to as
sometimes prone to overfitting. Dropout (Srivastava et al., 2014)
deep LSTMP (long short-term memory projected) (Sak et al.,
is an effective regularization method to control the overfitting
2014a). The LSTMP architecture is employed in the baseline
problem of neural networks. Dropout is also particularly effec-
LSTM RNN in this paper.
tive for the maxout networks because of better model averaging
(Goodfellow et al., 2013).
3. CMNN and RMNN The dropout regularization includes different strategies for
the training phase and the testing phase. During the SGD training
In this section we extend the exploration of the CMNN (con- phase, the activation hl is obtained by applying a binary mask
volutional maxout neural network) in our previous work (Cai ml to the original activations:
et al., 2014). We also propose the new RMNN (recurrent max- hl = ml  θ (WlT hl−1 + bl ), (12)
out neural network).
where θ stands for the neuron nonlinearity such as sigmoid or
maxout,  is element-wise product of the vectors. The elements
3.1. Maxout neurons of the binary mask ml obeys the distribution Bernoulli(1 − r).
The hyperparameter r is the ratio of the number of neurons to be
The maxout neurons are promising alternatives to the sigmoid omitted, which is often called the dropout rate. Lower dropout
neurons. Conventional sigmoid neurons, though most widely- rate preserves more information, while higher dropout rate per-
used in DNNs, may cause the vanishing gradient problem during forms more aggressive regularization.
SGD training. The maxout neurons effectively deal with the van- During the dropout testing phase, the neuron activations are
ishing gradient problem by producing constant gradient during no longer omitted. But the neuron activations are scaled down
SGD training. by the coefficient 1 − r to compensate for the dropout training.
The structure of the maxout neuron is illustrated in Fig. 3. This strategy is a simple way to perform model averaging (Li
Each maxout neuron consists of several pieces of alternative et al., 2013) and to improve the model generalization ability.
activations. The output of a maxout neuron is the maximum In this paper, we would like to explore the optimal dropout
value within its piece group, i.e., strategy for the CMNN and the RMNN.
hli = max zli j , (9)
j∈1,...,k 3.3. CMNN
where hli is the i-th maxout neuron output in the l-th layer, k Compared with fully-connected DNNs, CNN acoustic mod-
is the number of pieces for the maxout neurons, and zli j is the els have the advantage of discovering local spectral characteris-
j-th piece activation of the i-th neuron in the l-th layer. The tics within speech frames. However, the CNNs may still suffer
piece activations zl are connected directly to the neurons of from the vanishing gradient problem during SGD training. A
the previous layer through the weight matrix Wl and the bias natural way to deal with the vanishing gradient problem for the
vector bl : CNN is to apply the maxout neurons. The goal of the CMNN
zl = WlT hl−1 + bl . (10) (convolutional maxout neural network) is to combine the advan-
tages of both the CNN and the maxout neurons.
Neither the computation of zl nor the computation of hl includes In the CMNN, the convolutional structure, the max-pooling
nonlinear transforms such as sigmoid or tanh. The nonlinearity operation and the higher fully-connected layers are preserved
of the maxout neuron is achieved by the maximum value selec- from CNNs. But all the sigmoid neurons are replaced by the max-
tion process as in Eq. (9). This nonlinearity can also be viewed out neurons. The lower convolutional layers learn local proper-
as a feature selection process. ties of speech signals, while the maxout neurons provide better
M. Cai, J. Liu / Speech Communication 77 (2016) 53–64 57

Fig. 5. The structure of the RMNN.

performance is achieved by an LSTM RNN with 2 or 3 hidden


Fig. 4. An illustration of maxout and max-pooling applied to the convolutional layers. But the optimal number of hidden layers are more for
layers of the CMNN.
DNNs than LSTM RNNs, typically 7 for the Switchboard task
(Seide et al., 2011). The optimal numbers of hidden layers for
optimization performance. Fig. 4 illustrates the difference be- the maxout networks are even more (Goodfellow et al., 2013;
tween the maxout operation and the max-pooling operation. In Cai et al., 2013). Because the maxout neurons are good solutions
the convolutional layers, the maxout operation selects the max- to the vanishing gradient problems for DNNs, it is intuitively
imum value within a local spatial region in the same band. The beneficial to increase the depth of the LSTM RNN acoustic mod-
max-pooling is applied after the convolutional operation and the els using the maxout neurons.
maxout operation. The output of the max-pooling operation is Our proposed RMNN (recurrent maxout neural network) is
the maximum value from different bands. a hybrid deep model combining the LSTM RNN and the max-
The combination of CNNs and maxout neurons has been out neurons. Its structure is illustrated in Fig. 5. The LSTMP
explored in our previous work (Cai et al., 2014) and several other structure (Sak et al., 2014a) is used in the lower layers of the
works (Toth, 2014; Miao and Metze, 2014). Our models differ RMNN to model long-time dependencies of the input signals.
from the models of Miao and Metze (2014) because the maxout The fully-connected layers with maxout neurons are used af-
neurons in Miao and Metze (2014) are only applied to the fully- ter several LSTMP recurrent layers. Finally the softmax layer
connected layers in the models, not the convolutional layers. In produces the output.
Toth (2014), the models are applied to the phone recognition task Besides the increased network depth, another benefit of the
on the TIMIT corpus. While in this work, we apply the CMNN RMNN is that dropout training can be applied to the max-
to the large vocabulary speech recognition tasks. The CMNN in out neurons. Traditionally, the dropout regularization is mostly
this work also extends our previous work in Cai et al. (2014) as used with feed-forward neural networks. The work in Zaremba
we explore the optimal dropout strategy for the CMNN. We also et al. (2014) has discovered that dropout can be effective for
compare the CMNN models with the model structures in Miao RNNs when it is only applied to the non-recurrent connec-
and Metze (2014) in the experiments. tions. But applying the dropout regularization to the recurrent
connections is inappropriate because the dropout regulariza-
3.4. RMNN tion tends to damage the long-term memories learned by the
RNNs. This discovery inspires us to explore the dropout reg-
The LSTM RNN acoustic model is good at modeling long- ularization to the maxout neurons in the fully-connected part
time dependencies of the speech signals. When unfolded in time, of the RMNN, as the maxout neurons perform better model
the LSTM RNN has deep structures in both the time domain averaging in conjunction with dropout training. This is the
and the spatial domain. In the time domain, the LSTM RNN can first time the maxout neurons and the dropout training strat-
be viewed as a very deep neural network (the number of lay- egy are used for LSTM RNN models in speech recognition
ers is the same as the number of speech frames) with shared tasks.
weights. The LSTM RNN deals with the vanishing gradient
in the time domain by using linear recurrent connections with
gate signals. Thus the history states of the LSTM can have a 4. Experiments
long-term impact. However, the LSTM RNN does not explic-
itly deal with the vanishing gradient problem in the spatial do- In this section, we present the detailed experiments for the
main. In Sak et al. (2014a), it is discovered that the optimal CMNN and the RMNN acoustic models.
58 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64

Table 1
The corpus size and characteristics of the languages.

Language Release ID Training hours Dev hours Characteristics


Cantonese IARPA-babel101b-v0.4c 140.7 10.0 With 6 tones.
Pashto IARPA-babel104b-v0.4bY 77.3 10.1 Arabic language. Used in Afghanistan and
Pakistan. With primary and secondary stress.
Turkish IARPA-babel105b-v0.4 76.3 10.0 Arabic language. With primary stress.
Tagalog IARPA-babel106-v0.2g 83.7 9.9 Used in Philippines.
Vietnamese IARPA-babel107b-v0.7 87.1 10.0 With 6 tone.
Tamil IARPA-babel204b-v1.1b 62.3 10.1 Used in India, Sri Lanka and Singapore.

4.1. Data corpus Table 2


The baseline configurations of the languages. #: number of. AM: acoustic model.
LM: language model.
This effort uses language collections released under the
IARPA Babel Program. We experiment with 6 different lan- Language Vocabulary size # phonemes # AM states LM perplexity
guages. These languages include Cantonese, Pashto, Turkish, Cantonese 25,416 217 4610 127.6
Tagalog, Vietnamese and Tamil. The FullLP training sets are Pashto 22,484 127 4940 159.7
used for all these languages. Most of the training audio is con- Turkish 48,914 84 4866 431.6
Tagalog 26,319 139 4894 157.3
versational telephone speech, while some of the training audio Vietnamese 6,961 238 4647 134.8
is scripted read-style speech. All the development audio is tele- Tamil 64,601 34 4630 883.6
phone speech. We only use the telephone speech for training,
as adding scripted training data is found to degrade the perfor-
mance. The corpus size and characteristics of the languages are
listed in Table 1. dow of 11 frames (5 on the left and 5 on the right) is used for
the DNN input features. So the dimension of the DNN input
4.2. Baseline setup layer is 1320. The alignments for the DNN training are gener-
ated by the GMM-HMM with SAT and BMMI discriminative
The pronunciation lexicons are processed based on the re- training. The dimension of the DNN output layer are the same
leased lexicons and the language specific peculiarities (LSP) as the number of acoustic model states shown in Table 2. The
documents. The phonemes with tonal markers are used for Can- baseline DNN-HMM acoustic models have feed-forward fully-
tonese and Vietnamese. The phonemes with stress markers are connected structures with sigmoid neurons. We use 6 hidden
used for Pashto and Turkish. Trigram language models (LM) layers with 1500 neurons per hidden layer. The networks pa-
are trained using the transcriptions of the FullLP training packs rameters are initialized using the scheme introduced in Glorot
with modified Kneser–Ney smoothing. and Bengio (2010). DBN pre-training is then performed for 3
The baseline acoustic models are built with the Kaldi toolkit epochs. The SGD training is used to optimize the network pa-
(Povey et al., 2011). The training procedure are the same for rameters based on the cross entropy criterion. We use a batch size
all the six languages. The GMM-HMM acoustic models are of 128 and an initial learning rate of 0.000625 per frame. The
first trained using 13-dimensional PLP features concatenated network is trained for 20 epochs. The learning rate is halved at
with 3-dimensional pitch features (Ghahremani et al., 2014). the end of an epoch if the frame accuracy on the development set
The features are extracted with a frame length of 25 ms and a decreases. The momentum value is set to 0.0 for the first epoch
frame shift of 10 ms. After mean normalization per conversa- and 0.9 for the rest of the epochs. We use the Kaldi decoder with
tion side, 9 consecutive features frames are concatenated and the the beam set to 11.0 and the acoustic weight set to 0.083333
feature dimension is reduced to 40 by LDA plus a global semi- in all of our evaluation experiments. The performances of the
tied covariance transform. Next, the GMM-HMMs are trained by models are measured by token error rates (TER), which is char-
speaker adaptive training (SAT) based on the feature-space max- acter error rate (CER) for Cantonese, syllable word error rate
imum likelihood linear regression (fMLLR) transforms for the (SyllableWER) for Vietnamese and word error rate (WER) for
LDA features. The GMM-HMMs are further enhanced by dis- Pashto, Turkish, Tagalog and Tamil. The results of the baseline
criminative training using the boosted maximum mutual infor- models are given in Table 3. These results are comparable with
mation (BMMI) criterion. Approximately 5000 acoustic states
and 75,000 Gaussian mixtures are used for the GMM-HMMs.
Table 3
Some baseline configurations, including the vocabulary size, the
The token error rates (TER) (%) of the baseline models for all the languages.
number of phonemes, the number of acoustic model states and ML: maximum likelihood. BMMI: boosted maximum mutual information.
the language model perplexity, are given in Table 2.
Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil
The features for the DNN-HMM acoustic models are 40-
dimensional Mel filterbank features concatenated with first- and GMM-HMM ML 51.6 59.4 58.0 59.7 60.6 72.9
second-order derivatives. The features are normalized to zero GMM-HMM BMMI 47.7 56.0 53.9 54.2 57.0 70.2
DNN 44.8 51.2 47.6 49.8 53.1 66.7
mean and unit variance per conversation side. A context win-
M. Cai, J. Liu / Speech Communication 77 (2016) 53–64 59

the results published in various other works (Cui et al., 2013; Table 4
Karafiat et al., 2013; Tsakalidis et al., 2014; Chen et al., 2014). TERs (%) of different dropout strategies for the CNN models and the CMNN
models for Vietnamese.

4.3. Results of the CMNN Model Dropout layer TER


CNN None 52.5
The results of the CMNN acoustic models are presented in CNN Top 2 52.4
this section. We first explore different network structures and CNN Top 4 51.8
CMNN None 53.8
training strategies for Vietnamese. Then we give the results using CMNN Top 2 52.0
the optimal strategy for all the languages. CMNN Top 4 50.6
CMNN Top 6 49.9
4.3.1. The optimal dropout strategy
In our previous work, it is discovered that the CMNN models
trained in conjunction with the dropout strategy produce sig- to the top 2, the top 4 and the top 6 hidden layers of the CMNN.
nificant gains compared to the baseline CNN models. It is also A dropout rate of 0.2 is used. The CMNNs are trained for at
shown that the dropout regularization is effective when it is only most 25 epochs. The results are shown in Table 4. The results of
applied to the fully-connected hidden layers of the CMNN (Cai dropout training for the CNN models are also shown in Table 4,
et al., 2014). In this work, we take a step further to explore the for which we try to apply dropout to the top 2 and the top 4
optimal dropout strategy for CMNN. More specifically, dropout hidden layers. However, the CNN models with dropout training
for all or part of the fully-connected hidden layers of the CMNN requires up to 40 epochs to converge, making their training time
are tried and the results are compared. much longer than the CMNN models. Detailed comparisons of
We first train a standard CNN model as the baseline for the the training time are given in Section 4.5.
CMNN models. The CNN model contains two convolutional The results in Table 4 show that for both the CNN and the
hidden layers with 768 filters, one max-pooling hidden layer CMNN, the optimal dropout strategy is to apply dropout to all
with a pooling size of 3 between the convolutional layers, and 3 the fully-connected hidden layers. The CNN model with the
fully-connected hidden layers with 1500 nodes. The input fea- optimal dropout strategy brings 1.3% relative improvement over
tures for the CNN are the same as those for the DNN, namely 40- the CNN without dropout, while the CMNN model with the
dimensional Mel filterbanks with delta and delta-delta deriva- optimal dropout strategy brings 7.2% relative improvement over
tives. A context window of 11 frames is applied. So the input the CMNN without dropout. It is also shown that the TER of the
feature is divided into 40 bands with 33 feature maps per band. best CMNN model with dropout training is 3.7% relatively lower
The filters in the first convolutional layer have an input band than the best CNN, and is 5.0% relatively lower than the baseline
width of 8 and an input band shift of 1. The filters in the second CNN without dropout training. These results demonstrate the
convolutional layer have an input band width of 4 and an input effectiveness of dropout training for the CMNNs.
band shift of 1. We use an initial learning rate of 0.00078125
per frame and a batch size of 256. The CNN is trained for 20 4.3.2. Maxout neurons for the convolutional layers
epochs. The momentum value increases from 0.0 to 0.5 linearly The CMNN models in the previous experiments use max-
with every epoch. out neurons for both the convolutional and the fully-connected
The CMNN acoustic model is configured to have approxi- hidden layers. In Miao and Metze (2014), another method to
mately the same number of parameters as the CNN model. The combine the CNN and the maxout neurons is used, in which
first convolutional operation has an input band width of 8, an sigmoid neurons are used for the convolutional hidden lay-
input band shift of 1 and 512 filters. Then a maxout operation ers, and maxout neurons are used for the fully-connected hid-
with 2 pieces is applied to reduce the number of feature maps to den layers. In this experiment we compare the CMNN models
256 in each band. A max-pooling layer with a pooling size of 3 with maxout neurons or sigmoid neurons for the convolutional
follows the maxout operation. Another convolutional operation layers.
is applied on top of the max-pooling layer, which has an input The structure of the CMNN with sigmoid convolutional
band width of 4, an input band shift of 1 and 512 filters. The layers is configured as follows. The first convolutional operation
second convolutional hidden layer is also followed by a max- has an input band width of 8, an input band shift of 1 and 768
out operation with 2 pieces. Finally, 5 fully-connected hidden filters. Then the max-pooling with a pooling size of 3 is applied.
layers with 1000 maxout neurons and 2 pieces per neuron are The sigmoid nonlinearity is used for the output of the max-
applied, followed by the softmax output layer. Hence there are pooling layer. Another convolutional operation is applied after
totally 6 fully-connected hidden layers in the CMNN. The initial the sigmoid nonlinearity, which has an input band width of 4, an
learning rate for the CMNN is set to 0.0003125 per frame. The input band shift of 1 and 384 filters. The sigmoid nonlinearity is
momentum increases linearly from 0.0 to 0.5. also applied to the output of the second convolutional operation.
We try to apply dropout training to different parts of the fully- Then 5 fully-connected hidden layers with 1000 maxout neurons
connected layers of the CMNN to explore the optimal dropout and 2 pieces per neuron are applied before the softmax output
strategy. Heuristically, it may be considered beneficial to apply layer. The hyperparameters for SGD training is the same as the
dropout to the upper layers of the CMNN, as the neurons in the previous CMNN, namely an initial learning rate of 0.0003125
upper layers are more discriminative. We try to apply dropout per frame and a momentum value increasing from 0.0 to 0.5.
60 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64

Table 5 4.4.1. The optimal network structure


A comparison of the CMNN models with sigmoid or maxout neurons for the Since the RMNNs are deep in both the time domain the spatial
convolutional layers. The TERs (%) are for Vietnamese.
domain, there are different structures that may have an impact
Model Convolutional layer type TER on the performance of the models. For example, the number
CMNN Sigmoid 50.9 of recurrent hidden layers may affect the ability of the mod-
CMNN Maxout 49.9 els to learn temporal relationships, and the number of fully-
connected hidden layers may affect the ability of the models
to learn feature transforms. It is also interesting to compare the
We also use dropout training for all the 6 fully-connected results of the RMNNs and the results of deep LSTMP RNNs
hidden layers, as it is found to be optimal for CMNN. Training with extra fully-connected sigmoid or ReLU neurons, which are
lasts for 25 epochs. alternative model structures that are deeper than the LSTMP
The results for the CMNN with sigmoid or maxout convo- RNNs.
lutional layers are shown in Table 5. These results show a per- To give a solid performance comparison, we first train a stan-
formance gap between the CMNN with sigmoid convolutional dard LSTMP RNN as in Sak et al. (2014a). The model has 3
layers and the CMNN with maxout convolutional layers. The LSTMP layers, 1024 cells per layer and a linear projection of
reason for the superior performance of the maxout neurons is 512 dimensions for each layer. To keep the feature type con-
that the maxout neurons are easier to optimize than the sigmoid sistent with previous models, we use a single frame of Mel
neurons. The constant gradients of the maxout neurons during filterbank feature with 120 dimensions (40-dimensional filter-
SGD training is effective to deal with the vanishing gradient bank with delta and double delta) as the input to the LSTMP
problem, especially for the lower layers of the neural networks. RNN. The target label is delayed for 5 frames, i.e., the infor-
mation of previous feature frames and 5 future feature frames
4.3.3. Results for all the languages is used to predict the current label. Training is performed us-
After explorations conducted for Vietnamese, we now give ing truncated backpropagation through time (BPTT) algorithm.
the results for all the 6 languages. The network structures and the A fixed subsequence length of 20 frames is used in the BPTT
training process are the same as those mentioned in the previous training. A stream of 20 subsequences from different utterances
experiments. The CMNN models are trained with dropout for are processed in parallel for efficient GPU computation. Af-
all the fully-connected hidden layers. The results are shown in ter the training of a subsequence from an utterance finishes,
Table 6. The results for the CNN models with or without dropout the initial status of the LSTM is set to the last status of the
training are both presented. finished subsequence if the next subsequence begins from the
The results in Table 6 show that the CMNN models outper- same utterance, or is reset if a subsequence begins from a new
form the CNN models across a variety of different languages. utterance. We use an initial learning rate of 0.0008 per frame.
Compared with the baseline CNN models, the relative improve- The momentum increases linearly from 0.0 to 0.5 with every
ments of the CMNN models are between 4.3% and 6.0%. The training epoch. The LSTM RNN is trained for a maximum of
results also show that the CMNNs with maxout convolutional 20 epochs.
layers have better performances than the CMNNs with sigmoid We try different RMNN structures that vary in the number
convolutional layers, confirming the discoveries in the previous of recurrent hidden layers and the number of fully-connected
experiments. maxout hidden layers. Two RMNNs are trained in our experi-
ments. One has 1 LSTMP recurrent layer and 5 maxout layers.
The other has 2 LSTMP recurrent layers and 4 maxout layers.
4.4. Results of the RMNN The LSTMP layers used in the RMNNs have 1024 cells and a
linear projection of 512 dimensions. The maxout layers have
In this section, we present the results of the proposed 1000 neurons with 2 pieces. Like the LSTM RNNs, the input to
RMNN acoustic models. Different model structures are first the RMNN is a single frame of Mel filterbank feature, while the
tried for Vietnamese, then the results for all the languages are training labels are delayed for 5 frames. We also use the subse-
given. quence length of 20 and the streams of 20 subsequences for the
BPTT training. The initial learning rate is set to 0.0003125 per
Table 6 frame and the momentum increases from 0.0 to 0.5. Besides, a
TERs (%) of the CNN models and the CMNN models for all the languages. dropout rate of 0.2 is used for the fully-connected layers. The
The “dropout fc” stands for dropout training for all the fully-connected hidden RMNNs are trained for a maximum of 30 epochs.
layers. The “sigmoid conv” stands for sigmoid neurons for the convolutional
Besides the RMNN, there are other ways to make the LSTMP
hidden layers.
RNNs deep in the spatial domain. We try to create deep RNN
Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil models by adding fully-connected sigmoid or ReLU neurons
CNN 43.6 51.5 48.3 49.6 52.5 67.2 on top of the LSTMP layers of the RNN, as comparisons to
CNN, dropout fc 46.1 52.0 47.6 49.6 51.8 66.0 the RMNN models. We refer to the models as DRNNs (deep
CMNN, sigmoid conv 43.4 50.6 46.7 48.8 50.9 65.1 recurrent neural networks). Three DRNNs are trained in the ex-
CMNN 41.7 49.3 45.4 47.2 49.9 64.2
periments. The first model has 1 LSTMP recurrent layer and 2
M. Cai, J. Liu / Speech Communication 77 (2016) 53–64 61

Fig. 6. A performance comparison of different acoustic models applied to all the languages.

Table 7 Table 8
A comparison of RNNs with different structures. The TERs (%) are for Viet- TERs (%) of the LSTM, the RMNN and the DRNN models for all the languages.
namese.
Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil
Model Structure TER
LSTM 40.7 50.5 47.4 47.9 47.8 65.0
LSTM 3 LSTMP 47.8 RMNN 39.0 48.1 44.9 45.7 46.0 63.4
RMNN 1 LSTMP + 5 maxout fc 46.9 DRNN 45.2 51.3 49.5 50.8 50.0 68.1
RMNN 2 LSTMP + 4 maxout fc 46.0
DRNN 1 LSTMP + 2 sigmoid fc 50.5
DRNN 2 LSTMP + 2 sigmoid fc 50.0
the best network structure explored before, i.e., the RMNN mod-
DRNN 1 LSTMP + 5 ReLU fc 47.8
els with 2 recurrent hidden layers and 4 maxout fully-connected
layers. The standard LSTM models with 3 recurrent layers and
sigmoid layers. The second model has 2 LSTMP recurrent lay- the DRNN models with 2 recurrent layers and 2 sigmoid fully-
ers and 2 sigmoid layers. The LSTMP layers have 1024 cells connected layers are also trained. The hyperparameters for the
and a linear projection of 768 dimensions. The sigmoid layers BPTT are the same as those in the previous experiments except
have 1500 neurons. The hyperparameters for the BPTT training for Cantonese, for which we use a dropout rate r of 0.1 as there
for these two models are the same as those for the LSTMP RNN is more data. The results are given in Table 8.
except for an initial learning rate of 0.000625 per frame. The The results show that the RMNN models achieve good per-
third model has 1 LSTMP recurrent layer and 5 ReLU layers. formances for different languages. The token error rates of the
The LSTMP layer has 1024 cells and a linear projection of 512 RMNN models are 2.5–5.3% relatively lower than the LSTM
dimensions. The ReLU layer has 1500 neurons. The hyperpa- models. The DRNNs, however, only show degraded perfor-
rameters for the BPTT training for this model are the same as mances compared to the LSTM models. The experiments for
those for the RMNN. all the languages demonstrate the effectiveness of the RMNN
The results of the RNN acoustic models are presented in models. We also give the performance comparison of differ-
Table 7. The results show that the TER of the LSTM model ent acoustic models in Fig. 6. The figure clearly shows that the
is 47.8%, which is a very strong baseline. Both the RMNNs proposed RMNN acoustic models achieve the best performance
with 1 or 2 recurrent hidden layers achieve better results than compared to other models for all the languages, outperforming
the LSTM. Results also show that the RMNN with 2 recur- the baseline DNNs by 4.9–13.4% relatively.
rent hidden layers has the best performance, achieving a TER of
46.0%. This result is 3.8% better than the LSTM relatively. How- 4.5. Training time
ever, the DRNNs with fully-connected sigmoid neurons do not
achieve gains over the baseline. The performance of the DRNN Apart from the token error rates, another matter to take into
with fully-connected ReLU neurons is on par with the baseline consideration is the training time. We use highly efficient GPU
LSTMP RNN. The results indicate that it is not trivial to build implementations for all the models in our experiments. The train-
deep RNNs. The better optimization performance of the maxout ing time per epoch and the number of required epochs for all the
neurons and the better model averaging with the dropout training models are given in Tables 9 and 10, respectively. The experi-
are beneficial for deep models. ments are carried out on NVIDIA Tesla K20m GPUs.
From the training time presented in Tables 9 and 10, we can
4.4.2. Results for all the languages see that the baseline DNN models cost the least training time.
The RMNN models give good results for Vietnamese. It is The training time per epoch for the CNN models is between
more convincing to see how the RMNN models perform for other 3 and 4 times slower than the DNNs. This is because the local
languages. We conduct experiments for all the 6 languages using weight connections in the convolutional structures require many
62 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64

Table 9
Training time per epoch (hours) of different acoustic models for all the languages.
a
Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil
DNN 2.35 1.32 1.12 1.43 1.22 1.04
CNN 7.07 4.61 3.83 4.98 4.22 3.70
CNN, dropout 7.13 4.68 3.90 5.04 4.37 3.79
CMNN 7.09 4.62 4.05 4.87 4.30 3.13
LSTM 5.33 3.17 3.07 3.44 3.22 2.93
RMNN 4.78 2.93 2.81 3.11 3.00 2.19

Table 10
Number of required training epochs of different acoustic models for all the
languages.

Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil


DNN 20 19 20 20 18 18
CNN 20 20 20 19 20 19
CNN, dropout 27 40 38 40 40 40
CMNN 24 25 24 25 24 25
LSTM 13 12 11 14 13 14
RMNN 22 27 29 30 28 23

small matrix multiplications, which are less efficient for GPUs. b


The training time per epoch for the CNNs with dropout training
is on par with that of the CMNNs. However, the CNNs with
dropout training need much more training epochs to converge
than the CMNNs. The CNNs with dropout training require up
to 40 training epochs, which are too computationally expen-
sive. The training time per epoch for the LSTM models and
the RMNN models is about 2 or 3 times slower than the DNNs,
which is more computationally efficient than the CNNs. The par-
allel processing of the feature streams from different utterances
enables efficient training of the LSTM recurrent structures.
Fig. 7 illustrates the frame accuracies for different models
on the training set and the development set for Vietnamese. The
figure shows that the CMNN model have better development
set accuracy than the baseline DNN and CNN, because dropout
training improves its generalization ability. The CNN with
dropout training shows small and steady improvements with ev-
ery training epoch, but it converges very slowly. The figure also
shows that the accuracies of the LSTM and the RMNN are much
better than other models. This is because the recurrent weight Fig. 7. The frame accuracy for different models on (a) the training set and (b)
connections are able to model longer frame context. The CMNN the development set for Vietnamese.
and the RMNN both have quick convergence on the develop-
ment set in spite of the dropout regularization, which shows the toolkit. The results for all the languages are given in Table 11.
good optimization performance of the maxout neurons. We first try to combine the CNN models and the CMNN mod-
els, both of which have the convolutional structures. The relative
4.6. System combination improvements of the combined results are between −0.2% and
3.0% compared to the CMNN models. Then the LSTM models
The previous experiments show the effectiveness of the and the RMNN models are combined, both of which have the re-
CMNN models and the RMNN models. Since the CMNN and current connections. The relative improvements of the combined
the RMNN make use of the prior knowledge of different aspects results are between 1.0% and 4.5% compared to the RMNN
of speech signals (i.e., the local spectral properties within frames models. These results suggest that the differences between the
and the long-term dependencies among frames), they may have LSTM models and the RMNN models are greater than the dif-
complementary information. In this experiment we try to com- ferences between the CNN models and the CMNN models. This
bine the results of different systems. is indeed the case since the RMNN requires the maxout neurons
We use the lattice-based system combination method pro- and extra fully-connected hidden layers compared to the LSTM,
posed in Xu et al. (2011), which is implemented in the Kaldi whereas the CMNN only requires the maxout neurons compared
M. Cai, J. Liu / Speech Communication 77 (2016) 53–64 63

Table 11
TERs (%) for system combination. The optimal results are shown in bold face.

Model Cantonese Pashto Turkish Tagalog Vietnamese Tamil


DNN 44.8 51.2 47.6 49.8 53.1 66.7
CNN 43.6 51.5 48.3 49.6 52.5 67.2
CMNN 41.7 49.3 45.4 47.2 49.9 64.2
LSTM 40.7 50.5 47.4 47.9 47.8 65.0
RMNN 39.0 48.1 44.9 45.7 46.0 63.4
CNN + CMNN 41.8 48.4 45.1 45.9 48.4 63.8
LSTM + RMNN 38.2 47.6 43.5 44.2 43.9 61.7
CMNN + RMNN 38.4 46.1 42.3 42.9 44.1 60.9
DNN + CNN + LSTM 40.3 48.1 44.1 44.7 46.8 62.6
DNN + CMNN + RMNN 39.3 46.5 42.7 43.4 45.5 61.2
CNN + CMNN + LSTM + RMNN 38.2 46.3 42.2 42.7 44.0 60.7
DNN + CNN + LSTM + CMNN + RMNN 38.7 46.6 42.4 43.0 44.9 60.9

to the CNN. The proposed CMNN models and RMNN models the CMNN is about 3–4 times slower than the DNN of the same
are also combined. The results are better than previous com- size. The training time per epoch for the RMNN is about 2–3
binations for Pashto, Turkish, Tagalog and Tamil, but a little times slower than the DNN of the same size. (5) System combi-
worse than the LSTM+RMNN results for Cantonese and Viet- nations using the CNN, the LSTM, the CMNN and the RMNN
namese. We believe this is because both the Cantonese and the models give state-of-the-art results for all the 6 languages. In the
Vietnamese are tonal languages. The pitch in tonal languages future we want to do multilingual training (Knill et al., 2013).
is relatively stable. So that the speech in tonal languages ex- We would also like to try more sophisticated CNN structures
hibits longer frame dependencies than the speech in non-tonal (Abdel-Hamid et al., 2014; Sainath et al., 2015) and try to de-
languages. The recurrent structures in the LSTMP RNNs and sign new models that combine the advantages of both the CMNN
the RMNNs are suitable to model long frame dependencies. models and the RMNN models.
Then we combine three baseline systems, i.e., the DNN, the
CNN and the LSTM models. We also combine the DNNs with
Acknowledgments
the CMNNs and RMNNs for comparison. The results show that
the latter combinations achieve relative improvements of 2.2%
The authors would like to thank anonymous reviewers for
to 3.3% over the former combinations. These results again con-
improving the paper. This work is supported by National Nat-
firm the effectiveness of the proposed models compared to the
ural Science Foundation of China under Grant Nos. 61273268,
baselines. We further try to combine 4 systems and 5 systems.
61370034, 61403224 and 61005017.
The best results are shown in bold face in Table 11. To the au-
thors’ knowledge, these are the state-of-the-art results on the
data sets. References

5. Conclusions Abdel-Hamid, O., Mohamed, A., Jiang, H., Penn, G., 2012. Applying convolu-
tional neural networks concepts to hybrid NN-HMM model for speech recog-
nition. In: Proceedings of International Conference on Acoustics, Speech and
In this paper, we extend our previous work on the convolu- Signal Processing. ICASSP. Kyoto, Japan, pp. 4277–4280.
tional maxout neural network (CMNN) acoustic model and pro- Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., Yu, D., 2014.
pose the new recurrent maxout neural network (RMNN) acoustic Convolutional neural networks for speech recognition. IEEE/ACM Trans.
model. Both the models take advantage of the maxout neurons, Audio Speech Lang. Process. 22 (10), 1533–1545.
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies
while they model the prior knowledge of the speech signals
with gradient descent is difficult. IEEE Trans. Neural Netw. 5 (2), 157–166.
in different aspects. We compare alternative model structures Cai, M., Shi, Y., Liu, J., 2013. Deep maxout neural networks for speech recogni-
and explore various dropout strategies for the CMNNs and the tion. In: Proceedings of Automatic Speech Recognition and Understanding.
RMNNs. The performances of the models are evaluated for 6 ASRU. Olomouc, Czech Republic, pp. 291–296.
different languages on the IARPA Babel data sets. System com- Cui, J., Cui, X., Ramabhadran, B., Kim, J., Kingsbury, B., Mamou, J., Mangu, L.,
binations are carried out using different models. Our main dis- Picheny, M., Sainath, T.N., Sethy, A., 2013. Developing speech recognition
systems for corpus indexing under the IARPA Babel program. In: Proceed-
coveries include: (1) Dropout training works well for both the ings of International Conference on Acoustics, Speech and Signal Process-
CMNN and the RMNN. The optimal strategy is to apply dropout ing. ICASSP. Vancouver, Canada, pp. 6753–6757.
to all the fully-connected layers of the models. (2) Replacing the Cai, M., Shi, Y., Kang, J., Liu, J., Su, T., 2014. Convolutional maxout neural net-
maxout neurons by the sigmoid neurons in the convolutional part works for low-resource speech recognition. In: Proceedings of International
Symposium on Chinese Spoken Language Processing. ISCSLP. Singapore,
of the CMNNs or the fully-connected part of the RMNNs only
pp. 133–137.
results in degraded performance. (3) The relative TER reductions Chen, N.F., Sivadas, S., Lim, B.P., Ngo, H.G., Xu, H., Pham, V.T., Ma, B.,
of the CMNNs over the CNNs are between 4.3% and 6.0%. The Li, H., 2014. Strategies for Vietnamese keyword search. In: Proceedings
relative TER reductions of the RMNNs over the LSTM RNNs of International Conference on Acoustics, Speech and Signal Processing.
are between 2.5% and 5.3%. (4) The training time per epoch for ICASSP. Florence, Italy, pp. 4121–4125.
64 M. Cai, J. Liu / Speech Communication 77 (2016) 53–64

Dahl, G.E., Yu, D., Deng, L., Acero, A., 2012. Context-dependent pre-trained Sainath, T.N., Mohamed, A., Kingsbury, B., Ramabhadran, B., 2013a. Deep
deep neural networks for large-vocabulary speech recognition. IEEE Trans. convolutional neural networks for LVCSR. In: Proceedings of International
Audio Speech Lang. Process. 20 (1), 30–42. Conference on Acoustics, Speech and Signal Processing. ICASSP. Vancou-
Deng, L., Huang, X., 2004. Challenges in adopting speech recognition. Com- ver, Canada, pp. 8614–8618.
mun. ACM 47 (1), 69–75. Sainath, T.N., Kingsbury, B., Mohamed, A., Dahl, G.E., Saon, G., Soltau, H.,
Gers, F.A., Schraudolph, N.N., Schmidhuber, J., 2002. Learning precise timing Beran, T., Aravkin, A., Ramabhadran, B., 2013b. Improvements to deep
with LSTM recurrent networks. J. Mach. Learn. Res. 3, 115–143. convolutional neural networks for LVCSR. In: Proceedings of Automatic
Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Speech Recognition and Understanding. ASRU. Olomouc, Czech Republic,
Khudanpur, S., 2014. A pitch extraction algorithm tuned for automatic pp. 315–320.
speech recognition. In: Proceedings of International Conference on Acous- Sak, H., Senior, A., Beaufays, F., 2014a. Long short-term memory recurrent neu-
tics, Speech and Signal Processing. ICASSP. Florence, Italy, pp. 2494–2498. ral network architectures for large scale acoustic modeling. In: Proceedings
Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feed- of Interspeech. Singapore, pp. 338–342.
forward neural networks. J. Mach. Learn. Res. 9, 249–256. Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R.,
Glorot, X., Bordes, A., Bengio, Y., 2011. Deep sparse rectifier neural networks. Mao, M., 2014b. Sequence discriminative distributed training of long short-
J. Mach. Learn. Res. 15, 315–323. term memory recurrent neural networks. In: Proceedings of Interspeech.
Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y., 2013. Singapore, pp. 1209–1213.
Maxout networks. In: Proceedings of International Conference on Machine Sainath, T.N., Kingsbury, B., Mohamed, A., Ramabhadran, B., 2013. Learning
Learning. ICML. Atlanta, USA, pp. 2356–2364. filter banks within a deep neural network framework. In: Proceedings of
Graves, A., Mohamed, A., Hinton, G.E., 2013a. Speech recognition with deep Automatic Speech Recognition and Understanding. ASRU. Olomouc, Czech
recurrent neural networks. In: Proceedings of International Conference on Republic, pp. 297–302.
Acoustics, Speech and Signal Processing. ICASSP. Vancouver, Canada, Sainath, T.N., Kingsbury, B., Saon, G., Soltau, H., Mohamed, A., Dahl, G.E.,
pp. 6645–6649. Ramabhadran, B., 2015. Deep convolutional neural networks for large-scale
Graves, A., Jaitly, N., Mohamed, A., 2013b. Hybrid speech recognition with deep speech tasks. Neural Netw. 64, 39–48.
bidirectional LSTM. In: Proceedings of Automatic Speech Recognition and Saon, G., Soltau, H., Emami, A., Picheny, M., 2014. Unfolded recurrent neural
Understanding. ASRU. Olomouc, Czech Republic, pp. 273–278. networks for speech recognition. In: Proceedings of Interspeech. Singapore,
Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A fast learning algorithm for deep pp. 343–347.
belief nets. Neural Comput. 18 (7), 1527–1554. Seide, F., Li, G., Yu, D., 2011. Conversational speech transcription using context-
Hinton, G.E., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., dependent deep neural networks. In: Proceedings of Interspeech. Florence,
Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B., 2012. Deep neural Italy, pp. 437–440.
networks for acoustic modeling in speech recognition: the shared views of Seide, F., Li, G., Chen, X., Yu, D., 2011. Feature engineering in context-
four research groups. IEEE Signal Process. Mag. 29 (6), 82–97. dependent deep neural networks for conversational speech transcription. In:
Karafiat, M., Grezl, F., Hannemann, M., Vesely, K., Cernocky, J.H., 2013. BUT Proceedings of Automatic Speech Recognition and Understanding. ASRU.
Babel system for spontaneous Cantonese. In: Proceedings of Interspeech. Waikoloa, USA, pp. 24–29.
Lyon, France, pp. 2589–2593. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.,
Knill, K.M., Gales, M.J.F., Rath, S.P., Woodland, P.C., Zhang, C., Zhang, S.-X., 2014. Dropout: a simple way to prevent neural networks from overfitting. J.
2013. Investigation of multilingual deep neural networks for spoken term Mach. Learn. Res. 15, 1929–1958.
detection. In: Proceedings of Automatic Speech Recognition and Under- Toth, L., 2013. Convolutional deep rectifier neural nets for phone recognition.
standing. ASRU. Olomouc, Czech Republic, pp. 138–143. In: Proceedings of Interspeech. Lyon, France, pp. 1722–1726.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning Toth, L., 2014. Convolutional deep maxout networks for phone recognition. In:
applied to document recognition. Proc. IEEE 86 (11), 2278–2324. Proceedings of Interspeech. Singapore, pp. 1078–1082.
Li, J., Wang, X., Xu, B., 2013. Understanding the dropout strategy and analyzing Tsakalidis, S., Hsiao, R., Karakos, D., Ng, T., Ranjan, S., Saikumar, G.,
its effectiveness on LVCSR. In: Proceedings of International Conference Zhang, L., Nguyen, L., Schwartz, R., Makhoul, J., 2014. The 2013 BBN
on Acoustics, Speech and Signal Processing. ICASSP. Vancouver, Canada, Vietnamese telephone speech keyword spotting system. In: Proceedings
pp. 7614–7618. of International Conference on Acoustics, Speech and Signal Processing.
Miao, Y., Metze, F., Rawat, S., 2013. Deep maxout networks for low-resource ICASSP. Florence, Italy, pp. 7829–7833.
speech recognition. In: Proceedings of Automatic Speech Recognition and Vincent, P., Larochelle, H., Bengio, Y., Manzegol, P.-A., 2008. Extracting and
Understanding. ASRU. Olomouc, Czech Republic, pp. 398–403. composing robust features with denoising autoencoders. In: Proceedings of
Miao, Y., Metze, F., 2014. Improving language-universal feature extraction with International Conference on Machine Learning. ICML. Helsinki, Finland,
deep maxout and convolutional neural networks. In: Proceedings of Inter- pp. 1096–1103.
speech. Singapore, pp. 800–804. Vinyals, O., Ravuri, S.V., Povey, D., 2012. Revisiting recurrent neural networks
Mohamed, A., Dahl, G.E., Hinton, G.E., 2012. Acoustic modeling using deep for robust ASR. In: Proceedings of International Conference on Acoustics,
belief networks. IEEE Trans. Audio Speech Lang. Process. 20 (1), 14– Speech and Signal Processing. ICASSP. Kyoto, Japan, pp. 4085–4088.
22. Weng, C., Yu, D., Watanabe, S., Juang, B.-H., 2014. Recurrent deep neural
Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted Boltz- networks for robust speech recognition. In: Proceedings of International
mann machines. In: Proceedings of International Conference on Machine Conference on Acoustics, Speech and Signal Processing. ICASSP. Florence,
Learning. ICML. Haifa, Israel, pp. 807–814. Italy, pp. 5532–5536.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Xu, H., Povey, D., Mangu, L., Zhu, J., 2011. Minimum Bayes Risk decoding and
Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., system combination based on a recursion for edit distance. Comput. Speech
Stemmer, G., Vesely, K., 2011. The Kaldi speech recognition toolkit. In: Lang. 25 (4), 802–828.
Proceedings of Automatic Speech Recognition and Understanding. ASRU. Zaremba, W., Sutskever, I., Vinyals, O., Recurrent neural network regularization,
Waikoloa, USA. arXiv:1409.2329 (2014).

You might also like