Professional Documents
Culture Documents
Multi-Column Deep Neural Networks For Image Classification
Multi-Column Deep Neural Networks For Image Classification
1
such as unsupervised pre-training [29, 24, 2, 10] or care-
fully prewired synapses [27, 31]. Fully connected
Fully connected
(3) The DNN of this paper (Fig. 1a) have 2-dimensional Max Pooling
AVERAGING
layers of winner-take-all neurons with overlapping recep- Convolution
tive fields whose weights are shared [19, 1, 32, 7]. Given Max Pooling
DNN
DNN
DNN
DNN
DNN
DNN
determines winning neurons by partitioning layers into Convolution
tive neuron of each region. The winners of some layer rep- Input
Image
resent a smaller, down-sampled layer with lower resolution,
feeding the next layer in the hierarchy. The approach is (a) (b)
inspired by Hubel and Wiesel’s seminal work on the cat’s
primary visual cortex [37], which identified orientation-
selective simple cells with overlapping local receptive fields TRAINING
and complex cells performing down-sampling-like opera- Image P D DNN
tions [15].
(4) Note that at some point down-sampling automati-
(c)
cally leads to the first 1-dimensional layer. From then on,
only trivial 1-dimensional winner-take-all regions are pos- Figure 1. (a) DNN architecture. (b) MCDNN architecture. The
input image can be preprocessed by P0 − Pn−1 blocks. An ar-
sible, that is, the top part of the hierarchy becomes a stan-
bitrary number of columns can be trained on inputs preprocessed
dard multi-layer perceptron (MLP) [36, 18, 28]. Recep-
in different ways. The final predictions are obtained by averag-
tive fields and winner-take-all regions of our DNN often ing individual predictions of each DNN. (c) Training a DNN. The
are (near-)minimal, e.g., only 2x2 or 3x3 neurons. This re- dataset is preprocessed before training, then, at the beginning of
sults in (near-)maximal depth of layers with non-trivial (2- every epoch, the images are distorted (D block). See text for more
dimensional) winner-take-all regions. In fact, insisting on explanations.
minimal 2x2 fields automatically defines the entire deep ar-
chitecture, apart from the number of different convolutional
kernels per layer [19, 1, 32, 7] and the depth of the plain
MLP on top.
3. Experiments
(5) Only winner neurons are trained, that is, other neu-
We evaluate our architecture on various commonly used
rons cannot forget what they learnt so far, although they
object recognition benchmarks and improve the state-of-
may be affected by weight changes in more peripheral lay-
the-art on all of them. The description of the DNN architec-
ers. The resulting decrease of synaptic changes per time
ture used for the various experiments is given in the follow-
interval corresponds to biologically plausible reduction of
ing way: 2x48x48-100C5-MP2-100C5-MP2-100C4-MP2-
energy consumption. Our training algorithm is fully online,
300N-100N-6N represents a net with 2 input images of size
i.e. weight updates occur after each gradient computation
48x48, a convolutional layer with 100 maps and 5x5 filters,
step.
a max-pooling layer over non overlapping regions of size
(6) Inspired by microcolumns of neurons in the cere-
2x2, a convolutional layer with 100 maps and 4x4 filters,
bral cortex, we combine several DNN columns to form a
a max-pooling layer over non overlapping regions of size
Multi-column DNN (MCDNN). Given some input pattern,
2x2, a fully connected layer with 300 hidden units, a fully
the predictions of all columns are averaged:
connected layer with 100 hidden units and a fully connected
#columns output layer with 6 neurons (one per class). We use a scaled
i 1 X
i hyperbolic tangent activation function for convolutional and
yM CDN N = yDN Nj (1)
N j
fully connected layers, a linear activation function for max-
pooling layers and a softmax activation function for the
where i corresponds to the ith class and j runs over output layer. All DNN are trained using on-line gradient
all DNN. Before training, the weights (synapses) of all descent with an annealed learning rate. During training,
columns are randomly initialized. Various columns can be images are continually translated, scaled and rotated (even
trained on the same inputs, or on inputs preprocessed in elastically distorted in case of characters), whereas only the
different ways. The latter helps to reduce both error rate original images are used for validation. Training ends once
and number of columns required to reach a given accuracy. the validation error is zero or when the learning rate reaches
The MCDNN architecture and its training and testing pro- its predetermined minimum. Initial weights are drawn from
cedures are illustrated in Figure 1. a uniform random distribution in the range [−0.05, 0.05].
3.1. MNIST
The original MNIST digits [19] are normalized such that
the width or height of the bounding box equals 20 pix-
els. Aspect ratios for various digits vary strongly and we (a)
therefore create six additional datasets by normalizing digit
width to 10, 12, 14, 16, 18, 20 pixels. This is like seeing
L6-Output
the data from different angles. We train five DNN columns 10
Layer
class "2"
per normalization, resulting in a total of 35 columns for L5-Fully
150 Connected
the entire MCDNN. All 1x29x29-20C4-MP2-40C5-MP3- Layer
L4-MaxPooling
150N-10N DNN are trained for around 800 epochs with an 40 @ 3x3
Layer
annealed learning rate (i.e. initialized with 0.001 multiplied 40 @ 9x9
L3-Convolutional
Layer
by a factor of 0.993/epoch until it reaches 0.00003). Train-
ing a DNN takes almost 14 hours and after 500 training
epochs little additional improvement is observed. During 800 @ 5x5 Filters
Table 2. Results on MNIST dataset. ing that additional preprocessing does not further improve
Method Paper Error rate[%]
CNN [32] 0.40
recognition.
CNN [26] 0.39
Table 3. Average test error rate [%] of MCDNN trained on y pre-
MLP [5] 0.35
processed datasets.
CNN committee [6] 0.27 y # MCDNN Average Error[%]
MCDNN this 0.23
1 7 0.33±0.07
2 21 0.27±0.02
3 35 0.27±0.02
How are the MCDNN errors affected by the number of 4 35 0.26±0.02
preprocessors? We train 5 DNNs on all 7 datasets. A 5 21 0.25±0.01
MCDNN ’y out-of-7’ (y from 1 to 7) averages 5y nets 6 7 0.24±0.01
trained on y datasets. Table 3 shows that more preprocess- 7 1 0.23
ing results in lower MCDNN error.
We also train 5 DNN for each odd normalization, i.e. We conclude that MCDNN outperform DNN trained on
W11, W13, W15, W17 and W19. The 60-net MCDNN the same data, and that different preprocessors further de-
performs (0.24%) similarly to the 35-net MCDNN, indicat- crease the error rate.
Table 1. Test error rate [%] of the 35 NNs trained on MNIST. Wxx - width of the character is normalized to xx pixels
Trial W10 W12 W14 W16 W18 W20 ORIGINAL
1 0.49 0.39 0.40 0.40 0.39 0.36 0.52
2 0.48 0.45 0.45 0.39 0.50 0.41 0.44
3 0.59 0.51 0.41 0.41 0.38 0.43 0.40
4 0.55 0.44 0.42 0.43 0.39 0.50 0.53
5 0.51 0.39 0.48 0.40 0.36 0.29 0.46
References
We repeat the experiment with different random initial-
izations and compute mean and standard deviation of the er- [1] S. Behnke. Hierarchical Neural Networks for Image Inter-
ror, which is rather small for original images, showing that pretation, volume 2766 of Lecture Notes in Computer Sci-
our DNN are robust. Our MCDNN obtains a very low error ence. Springer, 2003. 1, 2
rate of 11.21%, greatly rising the bar for this benchmark. [2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle.
Greedy layer-wise training of deep networks. In Neural In-
The confusion matrix (Figure 5) shows that the MCDNN
formation Processing Systems, 2007. 1, 2
almost perfectly separates animals from artifacts, except for
[3] N. P. Bichot, A. F. Rossi, and R. Desimone. Parallel and
planes and birds, which seems natural, although humans serial neural mechanisms for visual search in macaque area
easily distinguish almost all the incorrectly classified im- V4. Science, 308:529–534, 2005. 1
ages, even if many are cluttered or contain only parts of the [4] P. R. Cavalin, A. de Souza Britto Jr., F. Bortolozzi,
objects/animals (see false positive and false negative images R. Sabourin, and L. E. S. de Oliveira. An implicit
in Figure 5). There are many confusions between different segmentation-based method for recognition of handwritten
animals; the frog class collects most false positives from strings of characters. In SAC, pages 836–840, 2006. 4
other animal classes, with very few false negatives. As ex- [5] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmid-
pected, cats are hard to tell from dogs, collectively causing huber. Deep, big, simple neural nets for handwritten digit
15.25% of the errors. The MCDNN with 8 columns (four recognition. Neural Computation, 22(12):3207–3220, 2010.
trained on original data and one trained for each preprocess- 1, 3
[6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmid-
ing used also for traffic signs) reaches a low 11.21% error
huber. Convolutional neural network committees for hand-
rate, far better than any other algorithm. written character classification. In International Conference
on Document Analysis and Recognition, pages 1250–1254,
4. Conclusion 2011. 1, 3
[7] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella,
This is the first time human-competitive results are re-
and J. Schmidhuber. Flexible, high performance convolu-
ported on widely used computer vision benchmarks. On tional neural networks for image classification. In Inter-
many other image classification datasets our MCDNN im- national Joint Conference on Artificial Intelligence, pages
proves the state-of-the-art by 30-80% (Tab. 7). We drasti- 1237–1242, 2011. 1, 2, 3, 6
cally improve recognition rates on MNIST, NIST SD 19, [8] A. Coates and A. Y. Ng. The importance of encoding ver-
Chinese characters, traffic signs, CIFAR10 and NORB. Our sus training with sparse coding and vector quantization. In
method is fully supervised and does not use any additional International Conference on Machine Learning, 2011. 5, 6
[9] E. M. Dos Santos, L. S. Oliveira, R. Sabourin, and P. Maupin. In Int. Joint Conf. on Neural Networks, pages 1906–1911,
Overfitting in the selection of classifier ensembles: a compar- 2005. 4
ative study between pso and ga. In Conference on Genetic [24] M. Ranzato, Y.-L. B. Fu Jie Huang, and Y. LeCun. Unsuper-
and Evolutionary Computation, pages 1423–1424. ACM, vised learning of invariant feature hierarchies with applica-
2008. 4 tions to object recognition. In Proc. of Computer Vision and
[10] A. C. P.-A. M. P. V. Dumitru Erhan, Yoshua Bengio and Pattern Recognition Conference, 2007. 2
S. Bengio. Why does unsupervised pre-training help deep [25] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun. Unsu-
learning? Journal of Machine Learning Research, 11:625– pervised learning of invariant feature hierarchies with ap-
660, 2010. 1, 2 plications to object recognition. In Proc. Computer Vi-
[11] K. Fukushima. Neocognitron: A self-organizing neural net- sion and Pattern Recognition Conference (CVPR’07). IEEE
work for a mechanism of pattern recognition unaffected by Press, 2007. 3
shift in position. Biological Cybernetics, 36(4):193–202, [26] M. A. Ranzato, C. Poultney, S. Chopra, and Y. Lecun. Effi-
1980. 1 cient learning of sparse representations with an energy-based
[12] E. Granger, P. Henniges, and R. Sabourin. Supervised Learn- model. In Advances in Neural Information Processing Sys-
ing of Fuzzy ARTMAP Neural Networks Through Particle tems (NIPS 2006), 2006. 3
Swarm Optimization. Pattern Recognition, 1:27–60, 2007. 4 [27] M. Riesenhuber and T. Poggio. Hierarchical models of ob-
[13] P. J. Grother. NIST special database 19 - Handprinted forms ject recognition in cortex. Nat. Neurosci., 2(11):1019–1025,
and characters database. Technical report, National Institute 1999. 2
of Standards and Thechnology (NIST), 1995. 1, 4 [28] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learn-
[14] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. ing internal representations by error propagation. In Parallel
Gradient flow in recurrent nets: the difficulty of learning distributed processing: explorations in the microstructure of
long-term dependencies. In S. C. Kremer and J. F. Kolen, cognition, vol. 1: foundations, pages 318–362. MIT Press,
editors, A Field Guide to Dynamical Recurrent Neural Net- Cambridge, MA, USA, 1986. 1, 2
works. IEEE Press, 2001. 1 [29] R. Salakhutdinov and G. Hinton. Learning a nonlinear em-
[15] D. H. Hubel and T. Wiesel. Receptive fields, binocular inter- bedding by preserving class neighborhood structure. In Proc.
action, and functional architecture in the cat’s visual cortex. of the International Conference on Artificial Intelligence and
Journal of Physiology (London), 160:106–154, 1962. 2 Statistics, volume 11, 2007. 2
[16] A. L. Koerich and P. R. Kalva. Unconstrained handwritten [30] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling
character recognition using metaclasses of characters. In Intl. operations in convolutional architectures for object recogni-
Conf. on Image Processing, pages 542–545, 2005. 4 tion. In International Conference on Artificial Neural Net-
[17] A. Krizhevsky. Learning multiple layers of features from works, 2010. 5
tiny images. Master’s thesis, Computer Science Department, [31] T. Serre, L. Wolf, S. M. Bileschi, M. Riesenhuber, and
University of Toronto, 2009. 1, 6 T. Poggio. Robust object recognition with cortex-like mech-
[18] Y. LeCun. Une procédure d’apprentissage pour réseau a seuil anisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–
asymmetrique (a learning scheme for asymmetric threshold 426, 2007. 2
networks). In Proceedings of Cognitiva 85, pages 599–604, [32] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for
Paris, France, 1985. 1, 2 convolutional neural networks applied to visual document
analysis. In Seventh International Conference on Document
[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
Analysis and Recognition, pages 958–963, 2003. 1, 2, 3
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, November 1998. 1, 2, [33] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The
3 German Traffic Sign Recognition Benchmark: A multi-class
classification competition. In International Joint Conference
[20] Y. LeCun, F.-J. Huang, and L. Bottou. Learning methods
on Neural Networks, 2011. 1, 5
for generic object recognition with invariance to pose and
lighting. In Computer Vision and Pattern Recognition, 2004. [34] D. Strigl, K. Kofler, and S. Podlipnig. Performance and scal-
1, 5 ability of gpu-based convolutional neural networks. Paral-
lel, Distributed, and Network-Based Processing, Euromicro
[21] Y. LeCun, L. D. Jackel, L. Bottou, C. Cortes, J. S. Denker,
Conference on, 0:317–324, 2010. 1
H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard,
[35] R. Uetz and S. Behnke. Large-scale object recognition with
and V. Vapnik. Learning algorithms for classification: A
CUDA-accelerated hierarchical neural networks. In IEEE
comparison on handwritten digit recognition. In J. H. Oh,
International Conference on Intelligent Computing and In-
C. Kwon, and S. Cho, editors, Neural Networks: The Statisti-
telligent Systems (ICIS), 2009. 1
cal Mechanics Perspective, pages 261–276. World Scientific,
1995. 3 [36] P. J. Werbos. Beyond Regression: New Tools for Prediction
and Analysis in the Behavioral Sciences. PhD thesis, Har-
[22] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang. Chinese
vard University, 1974. 1, 2
Handwriting Recognition Contest. In Chinese Conference
on Pattern Recognition, 2010. 1, 4, 5 [37] D. H. Wiesel and T. N. Hubel. Receptive fields of single
neurones in the cat’s striate cortex. J. Physiol., 148:574–591,
[23] J. Milgram, M. Cheriet, and R. Sabourin. Estimating accu-
1959. 2
rate multi-class probabilities with support vector machines.