Professional Documents
Culture Documents
competitive performance in visual recognition tasks via stochastic gradient descent. Although computing
on CIFAR-10 (Krizhevsky, 2009) and STL-10 (Coates the exact gradient is intractable, we can approximate
et al., 2011) datasets. In addition, our method also it using contrastive divergence (Hinton, 2002).
achieves state-of-the-art performance on phone classi-
fication tasks with the TIMIT dataset, which demon- 3. Learning Transformation-Invariant
strates wide applicability of our proposed algorithms Feature Representations
to other domains. 3.1. Transformation-invariant RBM
The rest of the paper is organized as follows. We pro- In this section, we formulate a novel feature learning
vide the preliminaries in Section 2, and in Section 3, we framework that can learn invariance to a set of lin-
introduce our proposed transformation-invariant fea- ear transformations based on the RBM. We begin the
ture learning algorithms. In Section 4, we review the section with describing the transformation operator.
previous work on invariant feature learning. Then, in
The transformation operator is defined as a mapping
Section 5, we report the experimental results on sev-
T : RD1 → RD2 that maps D1 -dimensional input vec-
eral datasets. Section 6 concludes the paper.
tors into D2 -dimensional output vectors (D1 ≥ D2 ).
2. Preliminaries In our case, we assume a linear transformation matrix
T ∈ RD2 ×D1 , i.e., each coordinate of the output vec-
In this paper, we present a general framework for
tor is represented as a linear combination of the input
learning locally-invariant features using transforma-
coordinates.
tions. For presentation, we will use the restricted
Boltzmann machine (RBM) as the main example.1 We With this notation, we formulate the transformation-
first describe the RBM below, followed by its novel ex- invariant restricted Boltzmann machine (TIRBM)
tension (Section 3). that can learn invariance to a set of transformations.
Specifically, for a given set of transformation matrices
The restricted Boltzmann machine is a bipartite undi-
Ts (s = 1, · · · , S), the energy function of TIRBM is
rected graphical model that is composed of visible and
defined as follows:
hidden layers. Assuming binary-valued visible and
hidden units, the energy function and the joint prob- E(v, H) = (5)
ability distribution are given as follows:2 K X
X S K X
X S
− (Ts v)T wj hj,s − bj,s hj,s − cT v
E(v, h) = −vT Wh − bT h − cT v, (1) j=1 s=1 j=1 s=1
1 S
P (v, h) = exp (−E(v, h)) (2) X
Z s.t. hj,s ≤ 1, hj,s ∈ {0, 1}, j = 1, · · · , K, (6)
D K s=1
where v ∈ {0, 1} are binary visible units, h ∈ {0, 1}
are binary hidden units, and W ∈ RD×K , b ∈ RK , where v are D1 -dimensional visible units, and wj are
and c ∈ RD are weights, hidden biases, and visible bi- D2 -dimensional (filter) weights corresponding to the
ases, respectively. Z is a normalization factor that de- j-th hidden unit. The hidden units are represented as
pends on the parameters {W, b, c}. Since RBMs have a matrix H ∈ {0, 1}K×S with hj,s as its (j, s)-th entry.
PS
no intra-layer connectivity, exact inference is tractable In addition, we denote zj = s=1 hj,s , zj ∈ {0, 1} as a
and block Gibbs sampling can be done efficiently using pooled hidden unit over the transformations.
the following conditional probabilities: In Equation (6), we impose a softmax constraint on
X hidden units so that at most one unit is activated at
p(hj = 1|v) = sigmoid( vi Wij + bj ) (3)
each row of H. This probabilistic max pooling3 al-
i
X lows us to obtain a feature representation invariant
p(vi = 1|h) = sigmoid( hj Wij + ci ) (4) to linear transformations. More precisely, suppose
j that the input v1 matches the filter wj . Given an-
other input v2 that is a transformed version of v1 , the
where sigmoid(x) = 1+exp1 (−x) . We train the RBM
TIRBM will try to find a transformation matrix Tsj so
parameters by minimizing the negative log-likelihood that the v2 matches the transformed filter TsTj wj , i.e.,
1
We will describe extensions to other feature learning v1T wj ≈ v2T TsTj wj .4 Therefore, v1 and v2 will both
algorithms in Section 3.4.
2 3
Due to space constraints, we present only the case A similar technique is used in convolutional deep belief
of binary-valued input variables; however, the RBM with networks (Lee et al., 2011), in which spatial probabilistic
real-valued input variables can be formulated straightfor- max pooling is applied over a small spatial region.
4
wardly (Hinton & Salakhutdinov, 2006; Lee et al., 2008). Note that the transpose TsT of a transformation ma-
Learning Invariant Representations with Local Transformations
v1 v2
Input
Probabilistic
(hi,si=1) (hj,sj =1) max pooling
zi (v1)=1 zj (v1)=1 zi (v2)=1 zj (v2)=1 Pooled features
Figure 1. Feature encoding of TIRBM. Here, v1 and v2 denote two image patches, and the shaded pattern inside v2
reflects v1 . The shaded patterns in transformed filters show the corresponding original filters wi or wj . The filters
selected via probabilistic max pooling across the set of transformations are shown in red arrows (e.g., in the rightmost
example, the hidden unit hj,sj corresponding to the transformation Tsj and the filter wj contributes to activate zj (v2 ).)
In this illustration, we assumed D1 = D2 and also the existence of the identity transformation.
activate zj after probabilistic max pooling. Figure 1 The expectation of pooled activation is written as
illustrates this idea. P T
s exp (wj Ts v + bj,s )
Compared to the regular RBM, the TIRBM can learn E[zj |v] = . (10)
1 + s exp (wjT Ts v + bj,s )
P
more diverse patterns, while keeping the number of pa-
rameters small. Specifically, multiplying transforma- Note that we regularize over the pooled hidden units
tion matrix (e.g., TsT wj ) can be viewed as increasing zj rather than individual hidden units hj,s . In our
the number of filters by the factor of S, but without experiments, we used L2 distance for D(·, ·), but one
significantly increasing the number of parameters due can also use KL divergence for the sparsity penalty.
to parameter sharing. In addition, by pooling over 3.3. Generating transformation matrices
local transformations, the filters can learn invariant
In this section, we discuss how to design the trans-
representations (i.e., zj ’s) to these transformations.
formation matrix T . For the ease of presentation, we
The conditional probabilities are computed as follows: assume 1-d transformations, but it can be extended
to 2-d cases (e.g., image transformations) straightfor-
exp (wjT Ts v + bj,s ) wardly. Further, we assume the case of D1 = D2 = D
p(hj,s = 1|v) = (7)
1 + s0 exp (wjT Ts0 v + bj,s0 )
P
here; but, we will discuss general cases later.
X
p(vi = 1|h) = sigmoid( (TsT wj )i hj,s + ci ) (8) As mentioned previously, T ∈ RD×D is a linear trans-
j,s formation matrix from x ∈ RD to y ∈ RD ; i.e., each
coordinate of y is constructed via linear combination
Similar to RBM training, we use stochastic gradient
of the coordinates in x with weight matrix T as follows:
descent to train TIRBM. The gradient of the log-
likelihood is approximated via contrastive divergence D
X
by taking the gradient of the energy function (Equa- yi = Tij xj , ∀i = 1, · · · , D. (11)
tion (5)) with respect to the model parameters. j=1
First, it can be readily adapted to autoencoders by which is composed of multiple layers of convolutional
defining the following softmax encoding and sigmoid restricted Boltzmann machines and probabilistic max
decoding functions: pooling, can learn a representation invariant to local
translation.
exp (wjT Ts v + bj,s )
fj,s (v) = (12) Besides translation, however, learning invariant fea-
1 + s0 exp (wjT Ts0 v + bj,s0 )
P
X tures for other types of image transformations have
v̂i = sigmoid( (TsT wj )i fj,s + ci ) (13) not been extensively studied. In contemporary work
j,s of ours, Kivinen and Williams (2011) proposed the
transformation equivariant Boltzmann machine, which
Following the idea of TIRBM, we can also formulate shares a similar mathematical formulation to our mod-
the transformation-invariant sparse coding as follows: els in that both try to infer the best matching fil-
K X
S ters by transforming them using linear transformation
(n) matrices. However, while their model was motivated
X X
min k TsT wj hj,s − v(n) k2 , (14)
W,H
n j=1 s=1 from the “global equivariance”, the main purpose of
our work is to learn locally-invariant features that can
s.t. kHk0 ≤ γ, kH(j, :)k0 ≤ 1, kwj k2 ≤ 1, (15)
be useful in classification tasks. Thus, rather than
where γ is a constant. The second constraint in (15) considering an algebraic group of transformation ma-
can be understood as an analogy to the softmax con- trices (e.g., “full rotations”), we focus on the variety
straint in Equation (6) of TIRBMs. of local transformations that include rotation, trans-
lation as well as scale variations. Furthermore, we
Similar to standard sparse coding, we can optimize effectively address the boundary effects that can be
the parameters by alternately optimizing W and H highly problematic in scaling and translation operators
while fixing the other. Specifically, H can be (approx- by forming a non-square T matrix, rather than zero-
imately) solved using Orthogonal Matching Pursuit, padding.5 In addition, we present a general framework
and therefore we refer this algorithm a transformation- of transformation-invariant feature learning and show
invariant Orthogonal Matching Pursuit (TIOMP). extensions based on the autoencoder and sparse cod-
ing. Overall, our argument is strongly supported by
4. Related Work the state-of-the-art performance in image and audio
Researchers have made significant efforts to de- classification tasks.
velop invariant feature representations. For exam-
ple, the rotation- or scale-invariant descriptors, such As another related work, feature learning methods
as SIFT (Lowe, 1999), have shown a great success in with topographic maps can also learn invariant repre-
many computer vision applications. However, these sentations (Hyvärinen et al., 2001; Kavukcuoglu et al.,
image descriptors usually demand a domain-specific 2009). Compared to these methods, our model is more
knowledge with a significant amount of hand-crafting. compact and has fewer parameters to train since it fac-
tors out the (filter) weights and their transformations.
As an alternative approach, several unsupervised In addition, given the same number of parameters, our
learning algorithms have been proposed to learn robust model can represent more diverse and variable input
feature representations automatically from the sensory patterns than topographic filter maps.
data. As an example, the denoising autoencoder (Vin-
cent et al., 2008) can learn robust features by trying 5. Experiments
to reconstruct the original data from the hidden repre- We begin by describing the notation. For images, we
sentations of randomly perturbed data generated from assume a receptive field size of r × r pixels (for input
the distortion processes, such as adding noise or mul- image patches) and a filter size of w × w pixels. We
tiplying zeros for randomly selected coordinates. define gs to denote the number of pixels corresponding
Among types of transformations relating to temporally to the transformation (e.g., translation or scaling). For
or spatially correlated data, translation has been ex- example, we translate the w × w filter across the r × r
tensively studied in the context of unsupervised learn- receptive field with a stride of gs pixels (Figure 2(a)),
ing. Specifically, convolutional training (LeCun et al., or scale down from (r−l·gs)×(r−l·gs) to w×w (where
1989; Kavukcuoglu et al., 2010; Zeiler et al., 2010; 0 ≤ l ≤ b(r − w)/gsc) by sharing the same center for
Lee et al., 2011; Sohn et al., 2011) is one of the most 5
We observed that learning with zero-padded squared
popular methods that encourages shift-invariance dur- transformation matrices showed significant boundary ef-
ing the feature learning. For example, the convolu- fect in its visualization of filters, and this often resulted in
tional deep belief network (CDBN) (Lee et al., 2011), significantly worse classification performance.
Learning Invariant Representations with Local Transformations
pixel patch that was densely extracted with a stride of Table 4. Phone classification accuracy on the TIMIT core
1 pixel, and averaged the patch-level activations over test set using linear SVMs.
each of the 4 quadrants in the image. Eventually, this Methods (Linear SVM) Accuracy
procedure yielded 4K-dimensional feature vectors for MFCC 68.2%
each image, which were fed into an L2-regularized lin- sparse RBM 76.3%
ear SVM. We performed 5-fold cross validation to de- sparse TIRBM 77.6%
termine the hyperparameter C. sparse coding (Ngiam et al., 2011) 76.8%
sparse filtering (Ngiam et al., 2011) 75.7%
For comparison to the baseline model, we separately
evaluated the sparse TIRBMs with a single type of
Table 5. Phone classification accuracy on the TIMIT core
transformation (translation, rotation, or scaling) us- test set using RBF-kernel SVMs.
ing K = 1, 600. As shown in Table 2, each single
Methods (RBF-kernel SVM) Accuracy
type of transformation in TIRBMs brought a signifi-
MFCC (baseline) 80.0%
cant performance gain over the baseline sparse RBMs.
MFCC + TIRBM (K= 256) 81.0%
The classification performance was further improved MFCC + TIRBM (K= 512) 81.5%
by combining different types of transformations into a MFCC + SC (Ngiam et al., 2011) 80.1%
single model. MFCC + SF (Ngiam et al., 2011) 80.5%
MFCC + CDBN (Lee et al., 2009) 80.3%
In addition, we also report the classification results
H-LMGMM (Chang & Glass, 2007) 81.3%
obtained using TIOMP-1 (see Section 3.4) for unsu-
pervised training. In this experiment, we used the fol-
lowing two-sided soft thresholding encoding function: accuracy on the TIMIT dataset. By following (Ngiam
et al., 2011), we generated 39-dimensional MFCC fea-
fj = max max(wjT Ts v − α, 0)
s
tures and used 11 contiguous frames of them as an
input patch. For TIRBMs, we applied three temporal
fj+K = max max(−wjT Ts v − α, 0) ,
s translations with the stride of 1 frame.
where α is a constant threshold that was cross- First, we compared the classification accuracy using
validated. As a result, we observed about 1% im- the linear SVM to evaluate the performance gain com-
provement over the baseline method (OMP-1/T) using ing from the unsupervised learning algorithms, by fol-
1,600 filters, which supports the argument that our lowing the experimental setup in (Ngiam et al., 2011).8
transformation-invariant feature learning framework As reported in Table 4, the TIRBM showed an im-
can be effectively transferred to other unsupervised provement over the sparse RBM, as well as the sparse
learning methods. Finally, by increasing the number of coding and sparse filtering.
filters (K= 4,000), we obtained better results (82.2%) In the next setting, we used the RBF-kernel SVM
than the previously published results using single-layer (Chang & Lin, 2011) on the extracted features that
models, as well as those using deep networks. are concatenated with MFCC features. We used de-
We also performed the object classification task on fault RBF kernel width for all experiments and per-
STL-10 dataset (Coates et al., 2011), which is more formed cross-validation on the C values. As shown in
challenging due to the smaller number of labeled train- Table 5, combining MFCC with TIRBM features was
ing examples (100 per class for each training fold). the most effective and resulted in 1% improvement in
Since the original images are 96×96 pixels, we down- classification accuracy over the baseline MFCC fea-
sampled the images into 32×32 pixels, while keeping tures. By increasing the number of TIRBM features
the RGB channels. We followed the same unsuper- to 512, we were able to beat the best published results
vised training and classification pipeline as we did for on the TIMIT phone classification tasks using hierar-
CIFAR-10. As reported in Table 3, there were con- chical LM-GMM classifier (Chang & Glass, 2007).
sistent improvements in classification accuracy by in-
corporating the various transformations in learning al- 6. Conclusion and Future Work
gorithms. Finally, we achieved 58.7% accuracy using In this work, we proposed novel feature learning algo-
1,600 filters, which is competitive to the best published rithms that can achieve invariance to the set of pre-
single layer result (59.0%). defined transformations. Our method can handle gen-
eral transformations (e.g., translation, rotation, and
5.4. Phone classification
8
We used K = 256 with a two-sided encoding func-
To show the broad applicability of our method to other tion by using the positive and negative weight matrices
data types, we report the 39-way phone classification [W, −W], as suggested by Ngiam et al. (2011).
Learning Invariant Representations with Local Transformations
scaling), and we experimentally showed that learn- Kavukcuoglu, K., Sermanet, P., Boureau, Y. L., Gregor,
ing invariant features for such transformations leads K., Mathieu, M., and LeCun, Y. Learning convolutional
to strong classification performance. In future work, feature hierachies for visual recognition. In NIPS, 2010.
we plan to work on learning transformations from the Kivinen, J. and Williams, C. Transformation equivariant
data and combine it with our algorithm. By automat- boltzmann machines. In ICANN, 2011.
ically learning transformation matrices from the data, Krizhevsky, A. Learning multiple layers of features from
we will be able to learn more robust features, which Tiny Images. Master’s thesis, University of Toronto,
will potentially lead to significantly better feature rep- 2009.
resentations. Krizhevsky, A. Convolutional deep belief networks on cifar-
Acknowledgments 10. Technical report, 2010.
This work was supported in part by a Google Faculty Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and
Bengio, Y. An empirical evaluation of deep architectures
Research Award.
on problems with many factors of variation. In ICML,
2007.
References
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.
Howard, R. E., Hubbard, W., and Jackel, L. D. Back-
Greedy layer-wise training of deep networks. In NIPS,
propagation applied to handwritten zip code recogni-
2007.
tion. Neural Computation, 1:541–551, 1989.
Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A li- Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief
brary for support vector machines. ACM Transactions network model for visual area V2. In NIPS, 2008.
on Intelligent Systems and Technology, 2:27:1–27:27,
2011. http://www.csie.ntu.edu.tw/~cjlin/libsvm. Lee, H., Largman, Y., Pham, P., and Ng, A. Y. Unsu-
pervised feature learning for audio classification using
Chang, H. and Glass, J. R. Hierarchical large-margin gaus- convolutional deep belief networks. In NIPS. 2009.
sian mixture models for phonetic classication. In ASRU,
2007. Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Un-
supervised learning of hierarchical representations with
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., convolutional deep belief networks. Communications of
and Schmidhuber, J. High-performance neural networks the ACM, 54(10):95–103, 2011.
for visual object classification. Technical report, 2011. Lowe, D. Object recognition from local scale-invariant fea-
tures. In ICCV, 1999.
Coates, A. and Ng, A. Y. The importance of encoding ver-
sus training with sparse coding and vector quantization. Ngiam, J., Koh, P. W., Chen, Z., Bhaskar, S., and Ng,
In ICML, 2011a. A. Y. Sparse filtering. In NIPS, 2011.
Coates, A. and Ng, A. Y. Selecting receptive fields in deep Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. Ef-
networks. In NIPS, 2011b. ficient learning of sparse representations with an energy-
based model. In NIPS, 2006.
Coates, A., Lee, H., and Ng, A. Y. An analysis of single- Sohn, K., Jung, D. Y., Lee, H., and Hero III, A. Efficient
layer networks in unsupervised feature learning. In AIS- learning of sparse, distributed, convolutional feature rep-
TATS, 2011. resentations for object recognition. In ICCV, 2011.
Hinton, G. E. Training products of experts by minimiz- van Hateren, J. H. and van der Schaaf, A. Independent
ing contrastive divergence. Neural Computation, 14(8): component filters of natural images compared with sim-
1771–1800, 2002. ple cells in primary visual cortex. Proceedings of the
Royal Society B, 265:359–366, 1998.
Hinton, G. E. and Salakhutdinov, R. Reducing the di-
mensionality of data with neural networks. Science, 313 Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A.
(5786):504–507, 2006. Extracting and composing robust features with denois-
ing autoencoders. In ICML, 2008.
Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and
algorithm for deep belief nets. Neural Computation, 18 Manzagol, P. A. Stacked denoising autoencoders: Learn-
(7):1527–1554, 2006. ing useful representations in a deep network with a local
denoising criterion. JMLR, 11:3371–3408, 2010.
Hyvärinen, A., Hoyer, P. O., and Inki, M. O. Topographic
independent component analysis. Neural Computation, Yu, K., Lin, Y., and Lafferty, J. Learning image representa-
13(7):1527–1558, 2001. tions from the pixel level via hierarchical sparse coding.
In CVPR, 2011.
Kavukcuoglu, K., Ranzato, M., Fergus, R., and LeCun, Y.
Learning invariant features through topographic filter Zeiler, M. D., Krishnan, D., Taylor, G. W., and Fergus, R.
maps. In CVPR, 2009. Deconvolutional networks. In CVPR, 2010.