Professional Documents
Culture Documents
Abstract
arXiv:1702.02549v1 [cs.CV] 8 Feb 2017
1
– We introduce feature learning from FV classification by training fW (·) in the feature learning stage. This gradi-
including all back-propagation rules to update feature ent propagation backwards through the FV layer closes the
representation. gap between the current one-direction usage of FV and deep
– The proposed method includes a re-formulation of neural networks. However, any back-propagation through
supervised GMM parameters learning which satisfies the FV layer ends up with a non-trivial mapping into more
GMM constraints naturally using batches. than half a million elements. Currently, we are only aware
of batch-wise methods to tackle this issue.
The remainder of this paper is organized as follows: Af- Using current methods comes along with several addi-
ter delimiting our work from related methods in this field tional drawbacks like the requirement of multiple indepen-
in Section 2, we describe the Fisher-Vector encoding that is dent pipelines and limiting the solution space when opti-
used for learning features in Section 3. Section 4 contains mizing (1) and discarding any relevant information like ∇θ
details for the back-propagation pass and the learning en- from the classifier.
vironment. We evaluate the proposed method in Section 5 Applying these methods to normalized inputs falls into
and provide concluding remarks in the last Section 6. the class of learning methods, where a SVM seeks for
the optimal separation hyperplane reducing the theoretical
2. Related Work upper-bound for the expected risk Rex , which depends on
the radius of the sphere containing all data points (see The-
The interest of re-using activation values of NN-layers
orem 2.1 in [23]). Our work builds on previous attempts
as mid-level features for training additional classifiers in
without violating this theorem and, thereby, obtaining a
combination with Fisher-Vector and VLAD became popu-
strong theoretical basis.
lar in recent work. Even for large-scale problems, methods
Overall, it is preferable to fully embed FV within deep
like sparse Fisher-Vector-Coding [11] yield state-of-the-art
architectures to enable deep end-to-end learning approaches
results in generic object recognition and scene classifica-
without decoupled stages as illustrated in Figure 1 for opti-
tion [5]. Thereby, computing a FV is usually done as an
mizing the complete set of parameters simultaneously.
independent step decoupled from the feature learning pro-
cess. A kind of stacking of Fisher-Vector layer in deep 3. Background
neural networks was applied to the ILSVRC-2010 chal-
lenge with impressive results in [20]. Still, their training Before deriving the update rules for the Fisher-Vector
method is a greedy layer-by-layer training without back- layer, we shortly recap the analysis and motivation of
propagation training steps, which disconnects FV from the Fisher-Vectors to be self contained. A comprehensive es-
already trained layers in the network. Another work con- say and more details can be found in [22].
sidering Fisher-Vectors as a pre-processing step followed
by dimensionality reduction methods and a multi-layer- Fisher Vectors Let X be a set of features
perceptron was recently discussed in [16], again without
back propagating updates. In addition, there is no rea- X = {x1 , x2 , . . . , xT }, xj ∈ RD
son for extracted mid-level features from a neural network
being the best choice in a completely different classifica- for a single image. In computer vision tasks X typically
tion method. The goal of sharing gradient information be- represents local image descriptors, e.g., 128-dimensional
tween the classifier and Fisher-Vector parameters was first SIFT features. We denote the probability density function
addressed in [21] by adapting the underlying GMM pa- as cλ (·) which corresponds to the observations X . The ob-
rameters from the classifier loss information. Their algo- servation X implies a score function
rithm samples GMM parameter gradients from all inputs.
To guarantee a decreasing loss they determine the optimal GX
λ = ∇λ ln cλ (X ),
update by line-search. From the computational aspect their with dimension only depending on the number of parame-
approach is limited to learn the Fisher-Kernel only due to ters, not the size of the sample. Let Fλ be the Fisher infor-
the enormous amount of gradient data, which has to be cal- mation matrix of cλ , a natural kernel (see [8]) for measuring
culated. the similarity between two samples X, Y is given by
Armed with the knowledge of the classifier loss, one
might ask how the points of the data set should move to T
K(X, Y ) = GX
λ F −1 GYλ ,
allow the classifier to better separate classes. Common h λ i
T
approaches do not incorporate this information shared be- Fλ = Ex∼cλ ∇λ ln cλ (x) (∇λ ln cλ (x)) .
tween the feature learning stage of original feature repre-
sentation and the classifier. Although it is not possible to Learning a classifier with kernel K(X, Y ) is equivalent to
shift data points directly, this mapping can be approximated learning a linear classifier on the Fisher Vectors FλX =
traditional combination of NN and FV proposed end-to-end architecture
co
py
Figure 1: Difference between the traditional combination of deep neural networks (left) with fully-connected layers (fc), non-
linearities (nl), normalization (norm) and FV compared to our proposed combination (right). Our approach fully integrates
the GMM and FV layer enabling end-to-end with batched back-propagation.
−1
Lλ GX λ , where Fλ = LTλ Lλ is the Cholesky decompo- µ1 Σ1
−1
sition of Fλ . Although, a natural choice is cλ to be a
Gaussian-Mixture-Model (GMM) as a weighted sum of K µ2 Σ2
Gaussians, an alternative is a hybrid Gaussian-Laplacian
λ1
[10].
λ2 ..
Assuming diagonal covariance matrices Σk = diag(σk2 ), .
0 < σk2 ∈ RD for efficient computations, learning a GMM λK µK Σ K
(4) are differentiable in (λj , µj , σj2 )j=1,...,K , these 2D dataset. There, we shifted each input following the FV-
parameters can be optimized using the gradients gradient. Notice, that Eq. (9) has a connection to inverse-
variance weighting.
∂x ln cj , ∂µ ln cj , ∂σ 2 ln cj and the derivatives of FV
∂ ∂ ∂
1 Consult our prototype implementation of symbolic derivatives in the for some νj , ζj ∈ R and ε > 0, which allows to optimize
supplemental material. objective (1) in an unconstrained setting in λ, µ, σ 2 , x and
provides a natural way to satisfy
P all GMM parameter con-
data set using hinge loss
10 10
straints σk2 > ε, λj ∈ [0, 1], ` λ` = 1, without numerical
issues or projection steps during gradient descent.
5 5
1 + 2D
batch of features ∂ X
18.9h 10.79s
µ ∂λ,µ,σ 2 F ×6306
∂
∂x F
X
1.9h 2.89s ×2367
σ2 sum 19.1h 13.71s ×5015
D