Professional Documents
Culture Documents
net/publication/312442303
CITATIONS READS
51 14,357
4 authors, including:
Some of the authors of this publication are also working on these related projects:
Climate Change and Consequences on the French West Indies View project
All content following this page was uploaded by Sylvain Chevallier on 18 January 2019.
1 Introduction
In statistical machine learning, a major issue is the selection of an appropriate
feature space where input instances have desired properties for solving a par-
ticular problem. For example, in the context of supervised learning for binary
classification, it is often required that the two classes are separable by an hy-
perplane. In the case where this property is not directly satisfied in the input
space, one is given the possibility to map instances into an intermediate feature
space where the classes are linearly separable. This intermediate space can ei-
ther be specified explicitly by hand-coded features, be defined implicitly with a
so-called kernel function, or be automatically learned. In both of the first cases,
it is the user’s responsibility to design the feature space. This can incur a huge
cost in terms of computational time or expert knowledge, especially with highly
dimensional input spaces, such as when dealing with images.
As for the third alternative, automatically learning the features with deep
architectures, i.e. architectures composed of multiple layers of nonlinear pro-
cessing, can be considered as a relevant choice. Indeed, some highly nonlinear
functions can be represented much more compactly in terms of number of param-
eters with deep architectures than with shallow ones (e.g. SVM). For example,
it has been proven that the parity function for n-bit inputs can be coded by
a feed-forward neural network with O(log n) hidden layers and O(n) neurons,
while a feed-forward neural network with only one hidden layer needs an expo-
nential number of the same neurons to perform the same task [1]. Moreover, in
the case of highly varying functions, learning algorithms entirely based on local
generalization are severely impacted by the curse of dimensionality [2]. Deep
architectures address this issue with the use of distributed representations and
as such may constitute a tractable alternative.
477
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
h2
h1
v = input
478
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
Figure 2: The RBM architecture with a visible (v) and a hidden (h) layers.
present them as a probabilistic model before showing how the neural network
equations arise naturally.
An RBM defines a probability distribution p on data vectors v as follows:
e−E(v,h)
p(v) = −E(u,g)
. (1)
h u,g e
The variable v is the input vector and the variable h corresponds to unob-
served features [7] that can be thought of as hidden causes not available in the
original dataset. An RBM defines a joint probability on both the observed and
unobserved variables which are referred to as visible and hidden units respec-
tively (see Fig. 2). The distribution is then marginalized over the hidden units
to give a distribution over the visible units only. The probability distribution
is defined by an energy function E (RBMs are a special case of energy-based
models [8]), which is usually defined over couples (v, h) of binary vectors by:
E(v, h) = − ai v i − bj h j − wij vi hj , (2)
i j i,j
with ai and bj the biases associated to the input variables vi and hidden variables
hj respectively and wij the weights of a pairwise interaction between them. In
accordance with (1), configurations (v, h) with a low energy are given a high
probability whereas a high energy corresponds to a low probability.
The energy function above is crafted to make the conditional probabilities
p(h|v) and p(v|h) tractable. The computation is done using the usual neural
network propagation rule (see Fig. 2) with:
⎛ ⎞
p(v|h) = p(vi |h) and p(vi = 1|h) = sigm ⎝aj + hj wij ⎠ ,
i j
p(h|v) = p(hj |v) and p(hj = 1|v) = sigm bj + vi wij , (3)
j i
479
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
where σi represents the variance of the input variable vi . Using this energy
function, the conditional probability p(h|v) is almost
unchanged but p(v|h)
becomes a multivariate Gaussian with mean ai + σi j wij hj and a diagonal
covariance matrix:
2
x − ai − σi j wij hj
−
1 2σi2
p(vi = x|h) = √ ·e ,
σi 2π
vi
p(hj = 1|v) = sigm bj + wij . (4)
i
σi
480
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
481
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
LSE (x, x̂) = (x̂i − xi )2 ,
i
LCE (x, x̂) = [xi log x̂i + (1 − xi ) log(1 − x̂i )].
i
482
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
Figure 5: The training scheme of a DAA. Noisy components are marked with a
cross.
goal, the instance fed to the network is not x but a corrupted version x̃. After
training, if the network is able to compute a reconstruction x̂ of x with a small
loss, then it is admitted that the network has learned to remove the noise in the
data in addition to encode it in a different feature space (see Fig. 5).
Finally, a Stacked Auto-Associator (SAA) [3, 19, 18, 20] is a deep neural
network trained following the deep learning scheme: an unsupervised greedy
layer-wise pre-training before a fine-tuning supervised stage, as explained in
Sect. 2.3 (see also Fig. 1). Surprisingly, for d dimensional inputs and layers
of size k d, a SAA rarely learns the identity function [3]. In addition, it is
possible to use different regularization rules and the most successful results have
been reported with adding a sparsity constraint on the encoding unit activations
[20, 21, 22]. This leads to learning very different features (w.r.t RBM) in the
intermediate layers and the network performs a trade-off between reconstruction
loss and information content of the representation [21].
1. rank the nl features according to their mutual information with the class
labels;
483
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
3. the value of ml with which the classifier has reached the lowest error rate
determines the number of features to retain.
However, the main drawback of using KPCA as the building block of an MKM
lies in the fact that the feature selection process must be done separately and
thus requires a time-expensive cross validation stage. To get rid of this issue
when training an MKM it is proposed in this special session [25] to use a more
efficient kernel method, the Kernel Partial Least Squares (KPLS).
KPLS does not need cross validation to select the best informative features
but embeds this process in the projection strategy [26]. The features are ob-
tained iteratively, in a supervised manner. At each iteration j, KPLS selects the
j th feature the most correlated with the class labels by solving an updated eigen-
problem. The eigenvalue λj of the extracted feature indicates the discriminative
importance of this feature. The number of features to extract, i.e. the number
of iterations to be performed by KPLS, is determined by a simple thresholding
of λj .
484
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
4 Discussion
4.1 What are the applicative domains for deep learning?
Deep learning architectures express their full potential when dealing with highly
varying functions, requiring a high number of labeled samples to be captured by
shallow architectures. Unsupervised pre-training allows, in practice, to achieve
good generalization performance when the training set is of limited size by po-
sitioning the network in a region of the parameter space where the supervised
gradient descent is less likely to fall in a local minimum of the loss function.
Deep networks have been largely applied to visual classification databases such
as handwritten digits1 , object categories2 3 4 , pedestrian detection [33] or off-
road robot navigation [34], and also on acoustic signals to perform audio clas-
sification [35]. In natural language processing, a very interesting approach [36]
gives a proof that deep architectures can perform multi-task learning, giving
state-of-the-art results on difficult tasks like semantic role labeling. Deep archi-
tectures can also be applied to regression with Gaussian processes [37] and time
series prediction [38]. In the latter, the conditional RBM have given promising
results.
Another interesting application area is highly nonlinear data compression.
To reduce the dimensionality of an input instance, it is sufficient for a deep
architecture that the number of units in its last layer is smaller than its input
dimensionality. In practice, limiting the size of a neuron layer can promote in-
teresting nonlinear structure of the data. Moreover, adding layers to a neural
network can lead to learning more abstract features, from which input instances
can be coded with high accuracy in a more compact form. Reducing the di-
mensionality of data has been presented as one of the first application of deep
learning [39]. This approach is very efficient to perform semantic hashing on
text documents [22, 40], where the codes generated by the deepest layer are
used to build a hash table from a set of documents. Retrived documents are
those whose code differs only by a few bits from the query document code. A
similar approach for a large scale image database is presented in this special
session [41].
485
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
4.3 Conclusion
The strength of deep architectures is to stack multiple layers of nonlinear pro-
cessing, a process which is well suited to capture highly varying functions with a
compact set of parameters. The deep learning scheme, based on a greedy layer-
wise unsupervised pre-training, allows to position deep networks in a parameter
space region where the supervised fine-tuning avoids local minima. Deep learn-
ing methods achieve very good accuracy, often the best one, for tasks where a
large set of data is available, even if only a small number of instances are la-
beled. This approach raises many theoretical and practical questions, which are
investigated by a growing and very active research community and casts a new
light on our understanding of neural networks and deep architectures.
486
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
Acknowledgements
This work was supported by the French ANR as part of the ASAP project under
grant ANR_09_EMER_001_04.
References
[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In Large-Scale Kernel
Machines. 2007.
[2] Y. Bengio, O. Delalleau, and N. Le Roux. The curse of highly variable functions for local
kernel machines. In NIPS, 2005.
[3] Y. Bengio, P. Lamblin, V. Popovici, and H. Larochelle. Greedy layer-wise training of deep
networks. In NIPS, 2007.
[4] G. E. Hinton, S. Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief
nets. Neur. Comput., 18:1527–1554, 2006.
[5] P. Smolensky. Information processing in dynamical systems: Foundations of harmony
theory. In Parallel Distributed Processing, pages 194–281. 1986.
[6] D. H . Ackley, G. E . Hinton, and T. J . Sejnowski. A learning algorithm for Boltzmann
machines. Cognitive Science, 9:147–169, 1985.
[7] Z. Ghahramani. Unsupervised learning. In Adv. Lect. Mach. Learn., pages 72–112. 2004.
[8] M. Ranzato, Y.-L. Boureau, S. Chopra, and Y. LeCun. A unified energy-based framework
for unsupervised learning. In AISTATS, 2007.
[9] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines.
In ICML, 2010.
[10] A. Krizhevsky. Convolutional deep belief networks on CIFAR-10. Technical report, Univ.
Toronto, 2010.
[11] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neur.
Comput., 14:1771–1800, 2002.
[12] Y. Bengio and O. Delalleau. Justifying and generalizing contrastive divergence. Neur.
Comput., 21:1601–1621, 2009.
[13] A. Fischer and C. Igel. Training RBMs depending on the signs of the CD approximation
of the log-likelihood derivatives. In ESANN, 2011.
[14] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2:1–127, 2009.
[15] H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value
decomposition. Biological Cybernetics, 59:291–294, 1988.
[16] G. E. Hinton. Connectionist learning procedures. Artificial Intelligence, 40:185–234, 1989.
[17] N. Japkowicz, S. J. Hanson, and M. A. Gluck. Nonlinear autoassociation is not equivalent
to PCA. Neur. Comput., 12:531–545, 2000.
[18] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing
robust features with denoising autoencoders. In ICML, 2008.
[19] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation
of deep architectures on problems with many factors of variation. In ICML, 2007.
[20] M. A. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse
representations with an energy-based model. In NIPS, 2006.
[21] M. Ranzato, Y-L. Boureau, and Y. LeCun. Sparse feature learning for deep belief net-
works. In NIPS, 2008.
487
ESANN 2011 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence
and Machine Learning. Bruges (Belgium), 27-29 April 2011, i6doc.com publ., ISBN 978-2-87419-044-5.
Available from http://www.i6doc.com/en/livre/?GCOI=28001100817300.
[22] P. Mirowski, M. Ranzato, and Y. LeCun. Dynamic auto-encoders for semantic indexing.
In NIPS WS8, 2010.
[23] Y. Cho and L. Saul. Kernel methods for deep learning. In NIPS, 2009.
[24] B. Schölkopf, A. J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel
eigenvalue problem. Neur. Comput., 10:1299–1319, 1998.
[25] F. Yger, M. Berar, G. Gasso, and A. Rakotomamonjy. A supervised strategy for deep
kernel machine. In ESANN, 2011.
[26] R. Rosipal, L. J. Trejo, and B. Matthews. Kernel PLS-SVC for linear and nonlinear
classification. In ICML, 2003.
[27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Handwritten digit recognition with a back-propagation network. In NIPS, 1990.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.
[29] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage
architecture for object recognition? In ICCV, 2009.
[30] G. Desjardins and Y. Bengio. Empirical evaluation of convolutional RBMs for vision.
Technical report, Univ. Montréal, 2008.
[31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations. In ICML, 2009.
[32] K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features
through topographic filter maps. In CVPR, 2009.
[33] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun.
Learning convolutional feature hierarchies for visual recognition. In NIPS. 2010.
[34] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, U. Muller, and
Y. LeCun. Learning long-range vision for autonomous off-road driving. J. Field Robot.,
26:120–144, 2009.
[35] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for audio
classification using convolutional deep belief networks. In NIPS, 2009.
[36] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep
neural networks with multitask learning. In ICML, 2008.
[37] R. Salakhutdinov and G. E. Hinton. Using deep belief nets to learn covariance kernels for
gaussian processes. In NIPS, 2008.
[38] M. D. Zeiler, G. W. Taylor, N. F. Troje, and G. E. Hinton. Modeling pigeon behaviour
using a conditional restricted Boltzmann machine. In ESANN, 2009.
[39] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313:504–507, 2006.
[40] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approximate Reasoning,
50:969–978, 2009.
[41] A. Krizhevsky and G. E. Hinton. Using very deep autoencoders for content-based image
retrieval. In ESANN, 2011.
[42] M. Ranzato, A. Krizhevsky, and G. E. Hinton. Factored 3-way restricted Boltzmann
machines for modeling natural images. In AISTATS, 2010.
[43] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. J. Mach. Learn. Res., 2010.
[44] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio. Why does
unsupervised pre-training help deep learning? J. Mach. Learn. Res., 11:625–660, 2010.
[45] A. Saxe, P. W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Ng. On random weights and
unsupervised feature learning. In NIPS WS8, 2010.
[46] L. Arnold, H. Paugam-Moisy, and M. Sebag. Unsupervised layer-wise model selection in
deep neural networks. In ECAI, 2010.
488