Professional Documents
Culture Documents
Maxout Networks: Ian J. Goodfellow David Warde-Farley Mehdi Mirza Aaron Courville Yoshua Bengio
Maxout Networks: Ian J. Goodfellow David Warde-Farley Mehdi Mirza Aaron Courville Yoshua Bengio
a series of hidden layers h = {h(1) , . . . , h(L) }. Dropout where zij = xT W···ij + bij , and W ∈ Rd×m×k and
trains an ensemble of models consisting of the set of b ∈ Rm×k are learned parameters. In a convolutional
all models that contain a subset of the variables in network, a maxout feature map can be constructed
both v and h. The same set of parameters θ is used by taking the maximum across k affine feature maps
to parameterize a family of distributions p(y | v; θ, µ) (i.e., pool across channels, in addition spatial loca-
where µ ∈ M is a binary mask determining which vari- tions). When training with dropout, we perform the
ables to include in the model. On each presentation of elementwise multiplication with the dropout mask im-
a training example, we train a different sub-model by mediately prior to the multiplication by the weights in
following the gradient of log p(y | v; θ, µ) for a different all cases–we do not drop inputs to the max operator.
randomly sampled µ. For many parameterizations of p A single maxout unit can be interpreted as making a
(such as most multilayer perceptrons) the instantiation piecewise linear approximation to an arbitrary convex
of different sub-models p(y | v; θ, µ) can be obtained by function. Maxout networks learn not just the rela-
elementwise multiplication of v and h with the mask tionship between hidden units, but also the activation
µ. Dropout training is similar to bagging (Breiman, function of each hidden unit. See Fig. 1 for a graphical
1994), where many different models are trained on dif- depiction of how this works.
ferent subsets of the data. Dropout training differs
Rectifier Absolute value Quadratic
from bagging in that each model is trained for only
one step and all of the models share parameters. For
hi (x)
hi (x)
hi (x)
this training procedure to behave as if it is training an
ensemble rather than a single model, each update must
have a large effect, so that it makes the sub-model in- x x x
Histogram of maxout responses Table 1. Test set misclassification rates for the best meth-
35 ods on the permutation invariant MNIST dataset. Only
30
# of occurrences
Table 2. Test set misclassification rates for the best meth- Table 3. Test set misclassification rates for the best meth-
ods on the general MNIST dataset, excluding methods that ods on the CIFAR-10 dataset.
augment the training data.
Method Test error
Method Test error
Stochastic pooling (Zeiler & 15.13%
2-layer CNN+2-layer NN (Jar- 0.53% Fergus, 2013)
rett et al., 2009) CNN + Spearmint (Snoek 14.98%
Stochastic pooling (Zeiler & 0.47% et al., 2012)
Fergus, 2013) Conv. maxout + dropout 11.68 %
Conv. maxout + dropout 0.45% CNN + Spearmint + data aug- 9.50 %
mentation (Snoek et al., 2012)
Conv. maxout + dropout + 9.38 %
data augmentation
5.1. MNIST
The MNIST (LeCun et al., 1998) dataset consists of 28
× 28 pixel greyscale images of handwritten digits 0-9,
with 60,000 training and 10,000 test examples. For the 5.2. CIFAR-10
permutation invariant version of the MNIST task, only
The CIFAR-10 dataset (Krizhevsky & Hinton, 2009)
methods unaware of the 2D structure of the data are
consists of 32 × 32 color images drawn from 10 classes
permitted. For this task, we trained a model consist-
split into 50,000 train and 10,000 test images. We pre-
ing of two densely connected maxout layers followed
process the data using global contrast normalization
by a softmax layer. We regularized the model with
and ZCA whitening.
dropout and by imposing a constraint on the norm of
each weight vector, as in (Srebro & Shraibman, 2005). We follow a similar procedure as with the MNIST
Apart from the maxout units, this is the same archi- dataset, with one change. On MNIST, we find the
tecture used by Hinton et al. (2012). We selected the best number of training epochs in terms of validation
hyperparameters by minimizing the error on a valida- set error, then record the training set log likelihood
tion set consisting of the last 10,000 training examples. and continue training using the entire training set un-
To make use of the full training set, we recorded the til the validation set log likelihood has reached this
value of the log likelihood on the first 50,000 exam- value. On CIFAR-10, continuing training in this fash-
ples at the point of minimal validation error. We then ion is infeasible because the final value of the learn-
continued training on the full 60,000 example train- ing rate is very small and the validation set error is
ing set until the validation set log likelihood matched very high. Training until the validation set likelihood
this number. We obtained a test set error of 0.94%, matches the cross-validated value of the training like-
which is the best result we are aware of that does not lihood would thus take prohibitively long. Instead, we
use unsupervised pretraining. We summarize the best retrain the model from scratch, and stop when the new
published results on permutation invariant MNIST in likelihood matches the old one.
Table 1.
Our best model consists of three convolutional maxout
We also considered the MNIST dataset without the layers, a fully connected maxout layer, and a fully con-
permutation invariance restriction. In this case, we nected softmax layer. Using this approach we obtain a
used three convolutional maxout hidden layers (with test set error of 11.68%, which improves upon the state
spatial max pooling on top of the maxout layers) fol- of the art by over two percentage points. (If we do not
lowed by a densely connected softmax layer. We were train on the validation set, we obtain a test set error
able to rapidly explore hyperparameter space thanks of 13.2%, which also improves over the previous state
to the extremely fast GPU convolution library devel- of the art). If we additionally augment the data with
oped by Krizhevsky et al. (2012). We obtained a test translations and horizontal reflections, we obtain the
set error rate of 0.45%, which sets a new state of the absolute state of the art on this task at 9.35% error.
art in this category. (It is possible to get better results In this case, the likelihood during the retrain never
on MNIST by augmenting the dataset with transfor- reaches the likelihood from the validation run, so we
mations of the standard set of images (Ciresan et al., retrain for the same number of epochs as the valida-
2010) ) A summary of the best methods on the general tion run. A summary of the best CIFAR-10 methods
MNIST dataset is provided in Table 2. is provided in Table 3.
Maxout Networks
CIFAR-10 validation error with and without droput Table 5. Test set misclassification rates for the best meth-
0.9
Dropout/Validation ods on the SVHN dataset.
No dropout/Validation
0.8
Dropout/Train
No dropout/Train Method Test error
0.7
0.6
(Sermanet et al., 2012a) 4.90%
Error
from cross-channel pooling. Rectifier units do best Model averaging: MNIST classification
0.026
without cross-channel pooling but with the same num- Sampling, maxout
ber of filters, meaning that the size of the state and the 0.024 Sampling, tanh
number of parameters must be about k times higher W/2, maxout
0.022 W/2, tanh
for rectifiers to obtain generalization performance ap-
0.020
proaching that of maxout.
Test error
0.018
Comparison of large rectifier networks to maxout
0.160 0.016
Maxout
Rectifier, no channel pooling 0.014
validation set error for best experiment
0.155
Rectifier + channel pooling
0.012
Large rectifier, no channel pooling
0.150
0.010
0.145 0.008
100 101 102 103
# samples
0.140
KL divergence between model averaging strategies training with dropout, we carried out two experiments.
0.0018
Maxout
First, we stressed the optimization capabilities of the
Tanh training algorithm by training a small (two hidden
0.0016
convolutional layers with k = 2 and sixteen kernels)
0.0014 model on the large (600,000 example) SVHN dataset.
When training with rectifier units the training error
KL divergence
0.0012
gets stuck at 7.3%. If we train instead with max-
0.0010 out units, we obtain 5.1% training error. As another
optimization stress test, we tried training very deep
0.0008
and narrow models on MNIST, and found that max-
0.0006 out copes better with increasing depth than pooled
rectifiers. See Fig. 9 for details.
0.0004
100 101 102 103 MNIST classification error versus network depth
# samples 0.16
Maxout test error
0.14 Rectifier test error
Figure 8. The KL divergence between the distribution pre- Maxout train error
0.12 Rectifier train error
dicted using the dropout technique of dividing the weights
by 2 and the distribution obtained by taking the geomet- 0.10
ric mean of the predictions of several sampled models de-
Error
0.08
creases as the number of samples increases. This suggests
that dropout does indeed do model averaging, even for deep 0.06
networks. The approximation is more accurate for maxout
0.04
units than for tanh units.
with dropout. We sampled several subsets of each 0.02
linear units (Salinas & Abbott, 1996; Hahnloser, 1998; Maxout: neg→pos
0.25 Pooled rect: pos→zero
Glorot et al., 2011). The only difference between max- Pooled rect: zero→pos
out and max pooling over a set of rectified linear units 0.20
is that maxout does not include a 0 in the max. Su-
perficially, this seems to be a small difference, but we 0.15
find that including this constant 0 is very harmful to
optimization in the context of dropout. For instance, 0.10
on MNIST our best validation set error with an MLP
is 1.04%. If we include a 0 in the max, this rises to 0.05
over 1.2%. We argue that, when trained with dropout,
maxout is easier to optimize than rectified linear units 0.00
20 30 40 50 60 70 80 90 100
with cross-channel pooling. # epochs
8.1. Optimization experiments Figure 10. During dropout training, rectifier units transi-
tion from positive to 0 activation more frequently than they
To verify that maxout yields better optimization per- make the opposite transition, resulting a preponderence of
formance than max pooled rectified linear units when 0 activations. Maxout units freely move between positive
and negative signs at roughly equal rates.
Maxout Networks
Optimization proceeds very differently when using dropout simplifies to SGD training. We tested the
dropout than when using ordinary stochastic gradi- hypothesis that rectifier networks suffer from dimin-
ent descent. SGD usually works best with a small ished gradient flow to the lower layers of the network
learning rate that results in a smoothly decreasing by monitoring the variance with respect to dropout
objective function, while dropout works best with a masks for fixed data during training of two different
large learning rate, resulting in a constantly fluctuat- MLPs on MNIST. The variance of the gradient on the
ing objective function. Dropout rapidly explores many output weights was 1.4 times larger for maxout on an
different directions and rejects the ones that worsen average training step, while the variance on the gradi-
performance, while SGD moves slowly and steadily ent of the first layer weights was 3.4 times larger for
in the most promising direction. We find empirically maxout than for rectifiers. Combined with our previ-
that these different operating regimes result in differ- ous result showing that maxout allows training deeper
ent outcomes for rectifier units. When training with networks, this greater variance suggests that maxout
SGD, we find that the rectifier units saturate at 0 less better propagates varying information downward to
than 5% of the time. When training with dropout, we the lower layers and helps dropout training to better
initialize the units to sature rarely but training grad- resemble bagging for the lower-layer parameters. Rec-
ually increases their saturation rate to 60%. Because tifier networks, with more of their gradient lost to satu-
the 0 in the max(0, z) activation function is a con- ration, presumably cause dropout training to resemble
stant, this blocks the gradient from flowing through regular SGD toward the bottom of the network.
the unit. In the absence of gradient through the unit,
it is difficult for training to change this unit to be- 9. Conclusion
come active again. Maxout does not suffer from this We have proposed a new activation function called
problem because gradient always flows through every maxout that is particularly well suited for training
maxout unit–even when a maxout unit is 0, this 0 is a with dropout, and for which we have proven a univer-
function of the parameters and may be adjusted Units sal approximation theorem. We have shown empirical
that take on negative activations may be steered to evidence that dropout attains a good approximation
become positive again later. Fig. 10 illustrates how to model averaging in deep models. We have shown
active rectifier units become inactive at a greater rate that maxout exploits this model averaging behavior
than inactive units become active when training with because the approximation is more accurate for max-
dropout, but maxout units, which are always active, out units than for tanh units. We have demonstrated
transition between positive and negative activations at that optimization behaves very differently in the con-
about equal rates in each direction. We hypothesize text of dropout than in the pure SGD case. By de-
that the high proportion of zeros and the difficulty of signing the maxout gradient to avoid pitfalls such as
escaping them impairs the optimization performance failing to use many of a model’s filters, we are able to
of rectifiers relative to maxout. train deeper networks than is possible using rectifier
To test this hypothesis, we trained two MLPs on units. We have also shown that maxout propagates
MNIST, both with two hidden layers and 1200 fil- variations in the gradient due to different choices of
ters per layer pooled in groups of 5. When we in- dropout masks to the lowest layers of a network, en-
clude a constant 0 in the max pooling, the resulting suring that every parameter in the model can enjoy
trained model fails to make use of 17.6% of the filters the full benefit of dropout and more faithfully emulate
in the second layer and 39.2% of the filters in the sec- bagging training. The state of the art performance of
ond layer. A small minority of the filters usually took our approach on five different benchmark tasks moti-
on the maximal value in the pool, and the rest of the vates the design of further models that are explicitly
time the maximal value was a constant 0. Maxout, on intended to perform well when combined with inex-
the other hand, used all but 2 of the 2400 filters in pensive approximations to model averaging.
the network. Each filter in each maxout unit in the
network was maximal for some training example. All Acknowledgements
filters had been utilised and tuned.
The authors would like to thank the developers of
8.3. Lower layer gradients and bagging Theano (Bergstra et al., 2010; Bastien et al., 2012),
in particular Frédéric Bastien and Pascal Lamblin for
To behave differently from SGD, dropout requires the their assistance with infrastructure development and
gradient to change noticeably as the choice of which performance optimization. We would also like to thank
units to drop changes. If the gradient is approximately Yann Dauphin for helpful discussions.
constant with respect to the dropout mask, then
Maxout Networks
References Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,
and Ng, A. Y. Reading digits in natural images with
Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, unsupervised feature learning. Deep Learning and Un-
Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, supervised Feature Learning Workshop, NIPS, 2011.
Bouchard, Nicolas, and Bengio, Yoshua. Theano: new
features and speed improvements. Deep Learning and Rifai, Salah, Dauphin, Yann, Vincent, Pascal, Bengio,
Unsupervised Feature Learning NIPS 2012 Workshop, Yoshua, and Muller, Xavier. The manifold tangent clas-
2012. sifier. In NIPS’2011, 2011. Student paper award.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Salakhutdinov, R. and Hinton, G.E. Deep Boltzmann
Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guil- machines. In Proceedings of the Twelfth International
laume, Turian, Joseph, Warde-Farley, David, and Ben- Conference on Artificial Intelligence and Statistics (AIS-
gio, Yoshua. Theano: a CPU and GPU math expression TATS 2009), volume 8, 2009.
compiler. In Proceedings of the Python for Scientific
Computing Conference (SciPy), June 2010. Oral Pre- Salinas, E. and Abbott, L. F. A model of multiplicative
sentation. neural responses in parietal cortex. Proc Natl Acad Sci
U S A, 93(21):11956–11961, October 1996.
Breiman, Leo. Bagging predictors. Machine Learning, 24
(2):123–140, 1994. Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann.
Convolutional neural networks applied to house numbers
Ciresan, D. C., Meier, U., Gambardella, L. M., and
digit classification. CoRR, abs/1204.3968, 2012a.
Schmidhuber, J. Deep big simple neural nets for hand-
written digit recognition. Neural Computation, 22:1–14, Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann.
2010. Convolutional neural networks applied to house num-
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. bers digit classification. In International Conference on
Deep sparse rectifier neural networks. In JMLR W&CP: Pattern Recognition (ICPR 2012), 2012b.
Proceedings of the Fourteenth International Confer- Snoek, Jasper, Larochelle, Hugo, and Adams,
ence on Artificial Intelligence and Statistics (AISTATS Ryan Prescott. Practical bayesian optimization of
2011), April 2011. machine learning algorithms. In Neural Information
Goodfellow, Ian J., Courville, Aaron, and Bengio, Yoshua. Processing Systems, 2012.
Joint training of deep Boltzmann machines for classifi-
Srebro, Nathan and Shraibman, Adi. Rank, trace-norm
cation. In International Conference on Learning Repre-
and max-norm. In Proceedings of the 18th Annual
sentations: Workshops Track, 2013.
Conference on Learning Theory, pp. 545–560. Springer-
Hahnloser, Richard H. R. On the piecewise analysis of Verlag, 2005.
networks of linear threshold neurons. Neural Networks,
11(4):691–697, 1998. Srivastava, Nitish. Improving neural networks with
dropout. Master’s thesis, U. Toronto, 2013.
Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex,
Sutskever, Ilya, and Salakhutdinov, Ruslan. Improving Wang, Shuning. General constructive representations for
neural networks by preventing co-adaptation of feature continuous piecewise-linear functions. IEEE Trans. Cir-
detectors. Technical report, arXiv:1207.0580, 2012. cuits Systems, 51(9):1889–1896, 2004.
Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Yu, Dong and Deng, Li. Deep convex net: A scalable ar-
Marc’Aurelio, and LeCun, Yann. What is the best chitecture for speech pattern classification. In INTER-
multi-stage architecture for object recognition? In SPEECH, pp. 2285–2288, 2011.
Proc. International Conference on Computer Vision
(ICCV’09), pp. 2146–2153. IEEE, 2009. Zeiler, Matthew D. and Fergus, Rob. Stochastic pooling for
regularization of deep convolutional neural networks. In
Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple International Conference on Learning Representations,
layers of features from tiny images. Technical report, 2013.
University of Toronto, 2009.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey.
ImageNet classification with deep convolutional neural
networks. In Advances in Neural Information Processing
Systems 25 (NIPS’2012). 2012.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner,
Patrick. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324,
November 1998.
Malinowski, Mateusz and Fritz, Mario. Learnable pooling
regions for image classification. In International Con-
ference on Learning Representations: Workshop track,
2013.