You are on page 1of 1

Improving Deep Neural Network Acoustic Models using Generalized Maxout Networks

Xiaohui Zhang, Jan Trmal, Daniel Povey, Sanjeev Khudanpur


Center for Language and Speech Processing & Human Language Technology Center of Excellence
The Johns Hopkins University, Baltimore, MD 21218, USA
{xiaohui,khudanpur@jhu.edu}, {dpovey,jtrmal}@gmail.com

ABSTRACT
q Maxout networks have brought signicant improvements to various speech recognition
and computer vision tasks.
q We introduce two new types of generalized maxout units, which we call p-norm and
soft-maxout, and nd the p-norm generalization of maxout consistently performs well in
our LVCSR task.
q A simple normalization technique was used to prevent instability during training.

Tuning the p-norm Network

Related works
q p-norm pooling strategy has been used for learning image features [K. Kavukcuoglu et
al., CVPR09] [Boureau et al., ICML10] [Sermanet et al., ICPR12 ]
q A recent work [Gulcehre et al., arXiv (Nov. 2013)] has proposed a learned-norm pooling
strategy for deep feedforward and recurrent neural networks.

Figure 4.In terms of WER, pnorm generally works well with fewer parameters
than the tanh system, and also overtrains more easily.
Figure 1. Tuning the group size and power p

q We found p=2, and group size = 5 to 10 perform consistently well.

ASR and KWS (Key Word Search) Experiments

Non-linearity Types
q Traditional nonlinearity for neural networks was sigmoidal functions (tanh or sigmoid).

q Rectified linear unit (ReLU, which is simply max(x, 0)) has also become popular.

q Recently [Goodfellow et at., ICML13] proposed Maxout non-linearity.


q Its a dimension-reducing: Each group of inputs produces one output: y = max{xi}
q Assuming all hidden layers have the same dimension, the weight matrix will increase the
dimension again.

Proposed Variants of Maxout


q Maxout (baseline):

q Soft-maxout :

Figure 2. Tuning number of layers

q p-norm:
1/ p

y = max i xi

y = log exp xi
i

Figure 5. WER in four languages under the limitedLP condiOon

"
p%
y = $ xi '
# i
&

Stablizing Training

q Here, we used a group size G = 10 and a number of groups K = 290 in each case (this
was tuned to give about 3 million parameters in the 2-layer case).
q The optimal number of hidden layers seems to be lower for 2-norm (at 3 layers) than for
the other nonlinearities (at around 5).

Overtraining with p-norm network

q When using maxout and related nonlinearities, training sometimes failed after many
epochs.
q Very large activations started appearing, so that the posteriors output from the softmax
layer could be zero.
q We solved this by introducing a renormalization layer after each maxout/pnorm layer
q This layer divides the whole input vector by its root-mean-square value (renormalizes so
rms=1)
q A many-to-many function (e.g. 200 to 200)
q Use this in test also.

Figure 6. ATWV in four languages under the limitedLP condiOon

Figure 3. pnorm overtrains more easily than the tanh system, in terms of the
objecOve funcOon (cross-entropy)

q The two bar-charts here show the WER and ATWV performance, respectively, of the
DNN+SGMM Limited LP systems in all four OP1 languages.
q The WERs are on the respective 10 hour development sets.
q The ATWVs for Bengali and Assamese correspond to NIST IndusDB scoring
protocols as of October 2013, while the Zulu and Haitian Creole ATWVs are based on
JHU-generated keyword lists and reference alignments (RTTM).

The authors
were supported by DARPA BOLT contract N\b{o} HR0011-12-C-0015, and IARPA BABEL contract N\b{o} W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the
www.PosterPresentations.com
authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government.
RESEARCH POSTER PRESENTATION DESIGN 2012