You are on page 1of 9

Backpropagation Training for Fisher Vectors within Neural Networks

Patrick Wieschollek∗ Fabian Groh∗ Hendrik P.A. Lensch


University of Tübingen

Abstract
arXiv:1702.02549v1 [cs.CV] 8 Feb 2017

The combination of (convolutional) NN and FV using


shallow architectures recently became popular [24, 11, 5,
Fisher-Vectors (FV) encode higher-order statistics of a 3, 2]. In a nutshell, these methods approach a minimiza-
set of multiple local descriptors like SIFT features. They al- tion of the empirical risk Remp , depending on static input
ready show good performance in combination with shallow features x ∈ RD , a feature transformation fW (·) with pa-
0
learning architectures on visual recognitions tasks. Current rameters W ∈ RD×D , a kernel F with parameter ξ ∈ RD
0
methods using FV as a feature descriptor in deep archi- and classifier parameters θ ∈ RD solving
tectures assume that all original input features are static.
We propose a framework to jointly learn the representation (W ? , ξ ? , θ? ) := arg min Remp (Fξ (fW (x)), θ). (1)
W,ξ,θ
of original features, FV parameters and parameters of the
classifier in the style of traditional neural networks. Our All previous methods perform a greedy-wise optimization
proof of concept implementation improves the performance of these parameters one after another. Usually these steps
of FV on the Pascal Voc 2007 challenge in a multi-GPU are: Training neural network parameters Ŵ for feature ex-
setting in comparison to a default SVM setting. We demon- traction on some loss functions, learning GMM compo-
strate that FV can be embedded into neural networks at ar- nents ξˆ on these fixed features fŴ (x) using expectation-
bitrary positions, allowing end-to-end training with back- maximization and finally optimizing a linear SVM to ob-
propagation. tain θ̂. Empirical results of this sequential approach already
improves performance compared NN. However, it seems
reasonable to share information between these optimization
1. Introduction steps in the fashion of deep neural networks to jointly solve
Many fundamental computer-vision problems rely on problem (1) in W, ξ, θ.
extracting multiple meaningful local rigid descriptors like Therefore, we propose a neural-network-like batch-
SIFT [12], GIST [13] or SURF [1] features from a single wised back-propagation training for optimizing W, ξ, θ to-
RGB image. Classifying a combination of these features gether, with a strong theoretical support of [23]. Unfortu-
with methods such as Multi-Layer-Perceptron (MPL) or lin- nately, directly tackling Eq. (1) in a joint optimization ap-
ear Support-Vector-Machines (SVM) often result in good proach comes at the price of handling a huge amount of
performance [3]. input data. A single image is described by T SIFT features,
Local features as a compressed image representation are T > 8 · 104 . For 5k Images from the Pascal Voc 2007 data
an important factor in visual recognition tasks. Methods set this results in approx. 204GB, in contrast to a single
like deep neural networks (NN) are widely used to learn 4128-dimensional FV per image. Fortunately, exploiting
good feature representations without hand-engineering ef- multi-GPU and sampling approaches makes it possible to
fort. Applied to datasets of large amounts of labeled exam- compute all necessary parameter update rules in reasonable
ples they define state-of-the art results in various computer time.
vision tasks and even surpass human performance [4]. Our main contributions in this paper can be summarized
Enriched features, which additionally encode higher- as follows:
order statistics of the underlying distribution, further im- – The proposed architecture includes FV, GMM, and
prove the classification results. Prominent examples are normalization as neural network modules, which en-
VLAD [9] and Fisher-Vectors (FV) [14], where the lat- ables end-to-end learning in the fashion of classical
ter is based on a Gaussian-Mixture-Model (GMM) fitted to neural networks.
the data and encodes information relative to each Gaussian – We provide the first multi-GPU accelerated implemen-
component. tation for FV computation in large-scale classification
∗ Indicates equal contribution. tasks.

1
– We introduce feature learning from FV classification by training fW (·) in the feature learning stage. This gradi-
including all back-propagation rules to update feature ent propagation backwards through the FV layer closes the
representation. gap between the current one-direction usage of FV and deep
– The proposed method includes a re-formulation of neural networks. However, any back-propagation through
supervised GMM parameters learning which satisfies the FV layer ends up with a non-trivial mapping into more
GMM constraints naturally using batches. than half a million elements. Currently, we are only aware
of batch-wise methods to tackle this issue.
The remainder of this paper is organized as follows: Af- Using current methods comes along with several addi-
ter delimiting our work from related methods in this field tional drawbacks like the requirement of multiple indepen-
in Section 2, we describe the Fisher-Vector encoding that is dent pipelines and limiting the solution space when opti-
used for learning features in Section 3. Section 4 contains mizing (1) and discarding any relevant information like ∇θ
details for the back-propagation pass and the learning en- from the classifier.
vironment. We evaluate the proposed method in Section 5 Applying these methods to normalized inputs falls into
and provide concluding remarks in the last Section 6. the class of learning methods, where a SVM seeks for
the optimal separation hyperplane reducing the theoretical
2. Related Work upper-bound for the expected risk Rex , which depends on
the radius of the sphere containing all data points (see The-
The interest of re-using activation values of NN-layers
orem 2.1 in [23]). Our work builds on previous attempts
as mid-level features for training additional classifiers in
without violating this theorem and, thereby, obtaining a
combination with Fisher-Vector and VLAD became popu-
strong theoretical basis.
lar in recent work. Even for large-scale problems, methods
Overall, it is preferable to fully embed FV within deep
like sparse Fisher-Vector-Coding [11] yield state-of-the-art
architectures to enable deep end-to-end learning approaches
results in generic object recognition and scene classifica-
without decoupled stages as illustrated in Figure 1 for opti-
tion [5]. Thereby, computing a FV is usually done as an
mizing the complete set of parameters simultaneously.
independent step decoupled from the feature learning pro-
cess. A kind of stacking of Fisher-Vector layer in deep 3. Background
neural networks was applied to the ILSVRC-2010 chal-
lenge with impressive results in [20]. Still, their training Before deriving the update rules for the Fisher-Vector
method is a greedy layer-by-layer training without back- layer, we shortly recap the analysis and motivation of
propagation training steps, which disconnects FV from the Fisher-Vectors to be self contained. A comprehensive es-
already trained layers in the network. Another work con- say and more details can be found in [22].
sidering Fisher-Vectors as a pre-processing step followed
by dimensionality reduction methods and a multi-layer- Fisher Vectors Let X be a set of features
perceptron was recently discussed in [16], again without
back propagating updates. In addition, there is no rea- X = {x1 , x2 , . . . , xT }, xj ∈ RD
son for extracted mid-level features from a neural network
being the best choice in a completely different classifica- for a single image. In computer vision tasks X typically
tion method. The goal of sharing gradient information be- represents local image descriptors, e.g., 128-dimensional
tween the classifier and Fisher-Vector parameters was first SIFT features. We denote the probability density function
addressed in [21] by adapting the underlying GMM pa- as cλ (·) which corresponds to the observations X . The ob-
rameters from the classifier loss information. Their algo- servation X implies a score function
rithm samples GMM parameter gradients from all inputs.
To guarantee a decreasing loss they determine the optimal GX
λ = ∇λ ln cλ (X ),

update by line-search. From the computational aspect their with dimension only depending on the number of parame-
approach is limited to learn the Fisher-Kernel only due to ters, not the size of the sample. Let Fλ be the Fisher infor-
the enormous amount of gradient data, which has to be cal- mation matrix of cλ , a natural kernel (see [8]) for measuring
culated. the similarity between two samples X, Y is given by
Armed with the knowledge of the classifier loss, one
might ask how the points of the data set should move to T
K(X, Y ) = GX
λ F −1 GYλ ,
allow the classifier to better separate classes. Common h λ i
T
approaches do not incorporate this information shared be- Fλ = Ex∼cλ ∇λ ln cλ (x) (∇λ ln cλ (x)) .
tween the feature learning stage of original feature repre-
sentation and the classifier. Although it is not possible to Learning a classifier with kernel K(X, Y ) is equivalent to
shift data points directly, this mapping can be approximated learning a linear classifier on the Fisher Vectors FλX =
traditional combination of NN and FV proposed end-to-end architecture

co
py

x fc nl fc nl loss x̂ GMM FV norm loss


x fc nl fc nl GMM FV norm loss
feature learning stage classifier
information sharing information sharing
feature learning stage and classifier
information sharing

Figure 1: Difference between the traditional combination of deep neural networks (left) with fully-connected layers (fc), non-
linearities (nl), normalization (norm) and FV compared to our proposed combination (right). Our approach fully integrates
the GMM and FV layer enabling end-to-end with batched back-propagation.

−1
Lλ GX λ , where Fλ = LTλ Lλ is the Cholesky decompo- µ1 Σ1
−1
sition of Fλ . Although, a natural choice is cλ to be a
Gaussian-Mixture-Model (GMM) as a weighted sum of K µ2 Σ2
Gaussians, an alternative is a hybrid Gaussian-Laplacian

λ1
[10].
λ2 ..
Assuming diagonal covariance matrices Σk = diag(σk2 ), .
0 < σk2 ∈ RD for efficient computations, learning a GMM λK µK Σ K

with K components gives (2D + 1)K tunable parameters


x ∈ RT ×D F(x) ∈ R(2D+1)K
λ1 . . . , λK , σ1 , . . . , σK ∈ RD , µ1 , . . . , µK ∈ RD ,
GMM
K Gaussians
where λk denotes the prior probability of Gaussian position position + distribution
N (·; µk , Σk ). For practical purposes all necessary compu-
tations are done in the log-domain using the logarithm of the Figure 2: The FV encodes the data distribution fitted to a
multivariate normal distribution. For a fixed xt we abbrevi- GMM in addition to the position of the feature.
ate cj := N (xt ; µj , σj2 ). The soft assignment or posterior
probability for arbitrary but fixed xt ∈ X and a GMM with
K components is given by Fisher-Vector representation ([15, 17]):
 
λk N (xt ; µk , σk2 ) F := FλX1 , . . . , FλXK , FµX1 , . . . , FµXk , FσX2 , . . . , FσX2 ,
γk (xt ) = PK ∈ [0, 1], 1 k
2
j=1 λj N (xt ; µj , σj )
which is usually classified by a shallow architecture, e.g. a
which can be computed using “logsumexp-trick” to avoid linear SVM. All calculus of vectors should be understood
explicit computation of cj . Computing component-wise. This scheme is illustrated in Figure 2.
Note, that F has a fixed dimension independent of T . As
1 X
T described in [22] eq. (2), (3) and (4) can P
be computed effi-
FλXk = √ (γk (xt ) − λk ) (2) ciently using pre-computed terms Skp = t γk (xt )xt p for
T λk t=1
xt ∈ X , p = 0, 1, 2. In our prototype implementation we
T   
1 X xt − µk were able to exploit the massively parallel nature of this
FµXk = √ γk (xt ) (3) step, which consists of several reductions.
T λk t=1 σk
Finally, a function composition of power-normalization
T   
1 X 1 (xt − µk )2
FσX2 = √ γk (xt ) √ − 1 (4) x 7→ sign(x) |x| ,
α
α ∈ (0, 1] (5)
k T λk t=1 2 σk2
and L2-normalization
and concatenating these K results FλXk ∈ R, FµXk , FσX2 ∈
k −1
RD into a (2D + 1)K dimensional vector leads to the x 7→ x kxk2 (6)
applied to F improves the performance [17] of the following gradient ∇x of FV
SVM+FV combination. 6

Backpropagation A neural network is usually a compo- 4


sition of functions F := Fn ◦ Fn−1 ◦ · · · ◦ F1 represented
as stacked layers: 2

Fj : Rd1 → Rd2 , Xj−1 7→ Xj := Fj (Wj , Xj−1 ),


0

where Xj−1 ∈ Rd1 is the input to the j-th layer and Wj is


a collection of tunable parameters. Given X0 , the objective −2
of the training phase for a neural network is to minimize a
−3 −2 −1 0 1 2 3 4 5
loss function L(x, F (x)) wrt. Wj=1,...,n most often using
stochastic gradient descent optimization.
Figure 3: Although, it is not possible to separate both
Given partial derivatives wrt. Xj it is possible to com- classes using the Fisher-Kernel, learning a transformation
pute the partial derivatives wrt. Xj−1 , Wj using the chain of x facilitates the classification of points. Here, each data
rule point x is shifted along the FV-gradient (darker means later
∂Fj ∂ ∂Fj+1 in the training process). After the optimization the points
= Fj (Wj , Xj−1 ) (7) are clearly separable.
∂Wj ∂W ∂Xj
∂Fj ∂ ∂Fj+1
= Fj (Wj , Xj−1 ) (8)
∂Xj−1 ∂X ∂Xj where
K
!
∂ X
as described in [18]. γk = γk (xi ) −βk (xi ) + βn (xi )γn (xi ) , (9)
∂xi n=1
4. Method xi − µk
βk (xi ) := (10)
σk2
In the following, we describe all necessary steps for
the Fisher-Vector layer to integrate this layer into back- and Kronecker-δab := (a==b). Therefore, any update will
propagation training. push descriptor xi into the direction of the k-th Gaussian
weighted by the posterior γk (xi ) and σ12 . The effect of
 X k

Fisher-Vector update rules: Since equations (2)- this gradient ∂[x]



Fµk d is illustrated in Figure 3 for a
e

(4) are differentiable in (λj , µj , σj2 )j=1,...,K , these 2D dataset. There, we shifted each input following the FV-
parameters can be optimized using the gradients gradient. Notice, that Eq. (9) has a connection to inverse-
variance weighting.
∂x ln cj , ∂µ ln cj , ∂σ 2 ln cj and the derivatives of FV
∂ ∂ ∂

wrt. ln cj by exploiting the chain-rule.


Elementary calculation yields all derivatives of (2), (3), Satisfying the GMM parameter constraints: It is cru-
(4) wrt. to the GMM parameters and x, namely1 cial to notPviolate the GMM parameter assumptions
 
X σk2 , λk > 0, k λk = 1 when applying the steepest descent
∂FλX ∂Fµ X ∂FσX2 ∂ [FµX
] ∂ Fσk2 d ∂ [FµXk ]d ∂ [FσXk ]d
∂λs
k
, ∂λs
k
, ∂λk
s
, k d
∂[µs ]e , ∂[µs ]e , ∂[σs2 ]e , ∂[σs2 ]e ,
rule. The authorsP [21] proposed to re-normalize the weights
∂FλX
X
∂ [Fµ ] ∂ [FµX ] λj=1,...,K = λj ( k λk )−1 in each iteration to satisfy con-
k
∂[x]e, ∂[x]k d , ∂[x]k d .
e e
straints for λj . Instead, we internally model each weight λk
See the appendix for the formulas. We use gradient descent as a sum of sigmoid-functions:
to update parameters in the backward step.
[1 + exp(−νj )]−1
One part of the gradient for the feature learning stage is λj (νj ) = P ∈ (0, 1) (11)
−1
given by ` [1 + exp(−ν` )]

    and represent σk2 as


∂  X 1 ∂γk (xt ) γk (x)
F =√ [αk (xt )]d + δed ,
∂ [x]e µk d λk ∂ [x]e σk d σk2 (ζj ) = ε + exp(ζj ) > 0 (12)

1 Consult our prototype implementation of symbolic derivatives in the for some νj , ζj ∈ R and ε > 0, which allows to optimize
supplemental material. objective (1) in an unconstrained setting in λ, µ, σ 2 , x and
provides a natural way to satisfy
P all GMM parameter con-
data set using hinge loss
10 10
straints σk2 > ε, λj ∈ [0, 1], ` λ` = 1, without numerical
issues or projection steps during gradient descent.
5 5

Design of SVM gradient information: Let 0 0


(xi , yi )i=1,...,N some training data consisting of feature
xi ∈ RD with label yi ∈ {−1, +1}. The back-propagation −5 −5
process starts from the SVM layer, the standard C-SVM: −5 0 5 −5 0 5

N using quadratic loss −yθT


1 C X
min + max{0, 1 − yi (hθ, xi i + b)} (13) 10 10
θ,b kθk2 N i=1
2
5 5
with regularization constant 1/C and hinge loss. The update
information from (13) is sparse and only forces changes 0 0
after a long training period, whenever the rare event oc-
curs that a batch contains a miss-classified point. hence, −5 −5

using the non-differential hinge loss for back-propagation −5 0 5 −5 0 5


one ends up with a sub-gradient-method. Although, the
quadratic hinge loss as in [21] is smoother, it also creates Figure 4: The update-effect of different loss (sub-) gradients
sparse gradients. in the SVM layer: The hinge loss produces sparse gradients.
2
Switching to the quadratic loss (1 − yi (hθ, xi i + b)) for On the other hand a GMM with coordinates-independence
back-propagating a dense gradient information would move assumption is not capable of representing the result of the
the data points to the margin (see Figure 4), which is unsuit- quadratic loss. We propose to directly use −yθT , which
able, since these clusters cannot be represented by Gaussian simply shifts each point in the correct direction away from
with diagonal covariance matrices. We use −yθT to shift the classification boundary.
all data towards the correct “side” away from the decision
boundary – detached from the actual SVM formulation for
classifying these points.
5. Implementation and Evaluation
We now describe our prototype CPU/GPU-based imple-
Normalization layer: The post-processing of FV, a func- mentation, the used datasets and the experimental evalua-
tion composition of (5) and (6), which refer to normaliza- tion of our approach. We evaluate the effectiveness of train-
tion layer can be expressed as ing additional parameters within the parameter W in our
experiments.
sign(x)|x|α
φ(x) = . (14) 5.1. Real-World Dataset and Feature extraction
k sign(x)|x|α k2
∂Fj ∂Fj+1 The publicly available Pascal Voc 2007 (Voc07) dataset
Its derivative = φ0 (Xj−1 ) for α = 0.5 is given
∂Xj−1 ∂Xj [6] comprises 20 classes of objects to be recognized, split up
as
into 5011 images for training and 4952 images for testing
2 x̂x̂T for each class.
φ0 (x) = (1D − T )1x6=0 (15)
kxk1 x̂ x̂ As the initial fixed features representation we use dense
p SIFT features extracted at multiple scales from OpenCV
for x̂ = sign(x) |x| and identity matrix 1D . normalized to [−1, 1]. Each of these 128 dimensional local
features is projected onto the first 64 principal components
Initialization: Learning the GMM is done by k-means similar to [21], [17]. We observed no decrease of perfor-
initialization until the convergence of the log-likelihood. mance when representing each image by randomly selected
Initializing the feature learning stage is the tricky part. 104 features per image instead of approx. 9 · 104 similar
When using a random transformation W of the PCA- to [21], which reduces the storage costs from 204GB to
projected SIFT vectors x , the algorithm suffers from a bad 25GB. Hence, in our approach each back-propagation needs
initialization. Therefore, we revert these transformations to to compute more than 25GB derivatives per epoch.
have exactly the same starting input for the Fisher layer by
5.2. Implementation Details
using features x̃ satisfying x = tanh(W x̃ + b) ∈ [−1, +1]
in our method. The parameter W is initialized using Xavier The implemented architecture consists of multiple lay-
[7] initialization. ers, which process the input features in batches of 24 images
GPU blocks task timing speedup of GPU
Matlab GPU
λ
F X
9.06s 20ms ×453

1 + 2D
batch of features ∂ X
18.9h 10.79s
µ ∂λ,µ,σ 2 F ×6306

∂x F
X
1.9h 2.89s ×2367
σ2 sum 19.1h 13.71s ×5015
D

Table 1: Timing comparison of vectorized MATLAB ver-


T K
sion and GPU implementation for a single batch of 24 im-
Figure 5: The feature set of each image is distributed to one ages.
GPU thread block. The GPU block computes the respective
calculus for each feature in the set one after another.
Timings For computing the FV and respective deriva-
tives, a comparison between our MATLAB and GPU imple-
mentation is shown in Table 1. The highly vectorized MAT-
for fast enough updates. Each image described by T PCA- LAB code is running on a i5-2500 CPU with 3.30GHz. In
transformed 64 dimensional features is used to compute a this experiment, the performance is measured on one CPU
4128 dimensional Fisher-Vector, which was normalized as core compared to one GPU. The batch size is 24 with 104
described in section 3. The underlying GMM contains 32 features each. The reported GPU timings also include all
Gaussians. We trained the SVM (13) using stochastic dual necessary memory transfers between host and device sys-
coordinate ascent (SDCA) with C := N , which gives supe- tem. Furthermore, both variants compute dense vectors and
rior performance than applying PEGASOS [19] in the pri- have no criteria to omit data.
mal. We allow for concurrent Kernel execution, which adapts
All remaining parameter updates are done by SGD with better for the available resources. The performance for dif-
learning rate η = 10−4 . In contrast, [21] determines an ferent batch sizes is illustrated in Figure 6, where we use all
optimal step size in each update. four GPUs as well as all four cores of the i5 CPU.
In summary, the MATLAB implementation with four
cores takes more than 4.7 hours to compute a complete
5.3. GPU Implementation forward and backward step of the Fisher layer for batches
of size 24. Our multi-GPU implementation reduces the
The main challenge of a batch-wise end-to-end Fisher- amount of computation time to less than 5.2 seconds. Note,
Vector implementation is the highly non-linear mapping of that for the PASCAL VOC 2007 challenge a complete
each FV-batch back to feature and GMM space. Further- sweep over the training data comprises 208 batches of 24
more, the computation of those derivatives has a high com- images, which still results in about 18 minutes runtime per
plexity of O(K 2 D2 T ). Thus, a fast GPU implementation epoch.
is indispensable to train the classifier in a feasible amount
of time. 5.4. Experimental Results
Our approach is well suited to take advantage of GPU To ensure fair evaluation we tested our methods against
parallism. There are two levels of granularity in our imple- the baseline method of only training the SVM classifier. To
mentation. The first is by processing the images in parallel. compare our method against [21], we re-implemented their
In this fashion, each set of features is assigned to one block algorithm up to batch-wise updates. As an evaluation met-
of threads. The second is by utilizing a block layout, which ric, we use the average precision (AP), corresponding to the
is shaped like the derivatives themselves. Hence, it is pos- area under the precision-recall curve.
sible to process all elements of a feature in parallel while In detail, our training process consists of different
sweeping over the complete set of features. This procedure phases. First, the initial training (with at most 15 epochs,
ensures that global memory access is kept at a minimum, ≈ 3000 iterations) of SVM was done. We observed the
while all computations are performed with on-chip mem- convergence of the SVM within this phase for all 20 exper-
ory. Figure 5 shows a scheme of this approach. For the im- iments, i.e., the duality gap is less than 0.01. This initial
plementation we used CUDA and a setting of four NVIDIA training guarantees already good performance as reported
Titan X GPUs. The limiting factors in our GPU approach in the first column of Table 2, ie. only θ was learned. Note,
are the number of available registers and the size of shared that our initial training already significantly outperforms the
memory. reported base-line results from [21].
105 price of much higher computation time. The corresponding
mean AP is comparable to [21]. Starting again from the
strong initial training, we now trained all parameters from
104 the GMM and SVM including the feature mapping of our
time in seconds

linear layer with Tanh activation. Allowing the first layer


103 to update parameters W and b makes the solution process
more flexible allowing the fitting of tanh(W x + b) better to
the GMM parameters.
102 Our approach of optimizing all parameters increases the
gain of training only GMM parameters by more than three
times from 52.8 (+0.7) to 54.7 (+2.6). Remarkably, the total
101
computational effort increases only by factor 1.2, compared
to the methods of [21].
20 40 60 80 100
number of images per batch 6. Conlusion and Outlook
We introduce feature learning in combination with
Figure 6: Timing for CPU( ), async. GPU ( ) and
Fisher-Vectors in the fashion of neural networks, which
sync. GPU ( ) for a complete forward and backward
paves the way for a wider range of applications for Fisher-
batch-computation of the Fisher layer. Smaller is better.
Vectors. Our interpretation of the Fisher-Kernel as a mod-
Hence, all four GPUs are saturated for 24 images per batch.
ule with a forward and backward pass allows end-to-end
training-architectures to benefit from this data distribution
AP (when optimizing parameters) encoding scheme. Analogously to the huge impact of GPU
class θ θ, λk , µk , σk2 θ, λk , µk , σk2 , (W, b) implementations in deep learning methods, we expect fur-
ther progress for GPU-accelerated FV implementations in
aeroplane 74.1 75.2 (+1.1) 77.3 (+3.2) combinations with other methods.
bicycle 55.1 55.7 (+0.6) 59.5 (+4.4)
We believe that this approach enables several future di-
bird 39.6 40.8 (+1.4) 41.9 (+2.3)
rections. One interesting idea might be the embedding of
boat 68.2 68.5 (+0.3) 69.8 (+1.6)
Fisher-Vector modules in deep neural networks at arbitrary
bottle 24.2 24.8 (+0.6) 25.0 (+0.8)
positions. Extracted features from convolution filters can
bus 56.4 56.9 (+0.5) 59.0 (+2.6)
be pooled applying the Fisher layer instead of a fully con-
car 74.5 74.9 (+0.4) 77.8 (+3.3)
nected layer. We are planning to add our implementation to
cat 50.2 50.8 (+0.6) 53.8 (+3.6)
popular deep learning frameworks like Caffe or mxnet. An-
chair 49.9 50.6 (+0.7) 51.8 (+1.9)
other interesting way of using Fisher-Vectors, is to combine
cow 30.1 32.2 (+1.1) 34.3 (+4.2)
our approach with the work of [20] to train stacked Fisher
dining table 42.1 43.1 (+1.0) 44.5 (+2.4)
layers in deep architecture.
dog 34.3 35.0 (+0.7) 38.2 (+3.9)
Currently, the success of FV depends on robust down-
horse 75.3 75.2 (-0.1) 76.8 (+1.5)
sampling approaches like PCA. A batch-wise replacement
motorbike 55.1 55.3 (+0.2) 57.9 (+2.8)
would facilitate a large-scale end-to-end pipeline including
person 81.1 81.3 (+0.2) 82.6 (+1.5)
Fisher-Vector training. Solving challenges like pre-training
pottedplant 23.6 24.7 (+1.4) 27.5 (+3.9)
a Gaussian mixture model in a batch-wise online mode
sheep 37.7 38.9 (+1.2) 38.9 (+1.2)
would help to realize a fully embedded Fisher layout in a
sofa 48.9 49.8 (+0.9) 51.1 (+1.2)
classical back-propagation training.
train 75.6 75.9 (+0.3) 77.8 (+2.2)
tvmonitor 46.8 46.7 (-0.1) 49.5 (+2.7)
7. Appendix
mAP 52.1 52.8 (+0.7) 54.7 (+2.6)
The derivatives of the Fisher-Vector layer are given by:
Table 2: Results on the PascalVoc2007-Database, when op- ∂X
Fλ =
T δ
X sk (γk − λk ) − 2γs γk
p
k
timizing different parameter sets. ∂λs t=1 2λs λk

T γ α
X
∂ X k k δsk − 2γs
Fµ = p
∂λs k 2λs λ
t=1 k
T γ (α2 − 1)(δ
From this, we start to evaluate further training of GMM
∂ X X k ks − 2γs )
F 2 = k
σs p
∂λk t=1 2λs 2λk
parameters (λk , µk , σk2 )k=1,...,K and θ. Training the Fisher T γ (δ
X
∂ k ks − γs )αs
kernel alone slightly improves previous performance, at the
X
Fλ = p
∂ [µs ]e k [σs ]e λk
t=1
   
  T γ
∂ X
Fµ =
X k (δks − γs ) [αs ]e αk d − δsk δed
p [8] T. Jaakkola and D. Haussler. Exploiting generative models
∂ [µs ]e k d [σs ]e λk
t=1
  h i  in discriminative classifiers. In In Advances in Neural Infor-

"
X
# T
X γk [αs ]e (δks − γs )( α2 k d − 1) − 2δde δsk mation Processing Systems 11, pages 487–493. MIT Press,
F 2 = p
∂ [µs ]e σ
k d t=1 [σs ]e 2λk 1998. 2
h i 
∂ T
X γk (δks − γs ) α2s e −1 [9] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregat-
X
h
∂ σs2
i Fλ
k
=
t=1
h
2 σs 2
i p
λk ing local descriptors into a compact image representation.
e e
 
 h i  In Computer Vision and Pattern Recognition (CVPR), 2010
  T γk αk d (δsk − γs ) α2 s − 1 e − δsk δed
h

i X
Fµ =
X
h i p IEEE Conference on, pages 3304–3311, June 2010. 1
∂ σs2 k d 2 σs 2 λk
e t=1
 h
e
i h i 
[10] B. Klein, G. Lev, G. Sadeh, and L. Wolf. Fisher vectors
 2
∂ 
X
 T
X γk (δks − γs ) αs − 1 2
e
· α2k − 1 d − 2δks δed αk d derived from hybrid gaussian-laplacian mixture models for
h i Fσ = h i p
∂ σs2
e
k d
t=1 2 σs 2
e
2λk image annotation. CoRR, abs/1411.7399, 2014. 3
∂ X
Fλ = p
1 ∂γk (x) [11] L. Liu, C. Shen, L. Wang, A. van den Hengel, and C. Wang.
∂ [x]e k λk ∂ [x]e

Encoding high dimensional local features by sparse cod-
" # 
∂ 
X


= p
1

∂γk  
αk d + δed
γk (x)
 ing based fisher vectors. In Z. Ghahramani, M. Welling,
∂ [x]e
"
k
#
d λk ∂ [x]e σk d
!
C. Cortes, N. Lawrence, and K. Weinberger, editors, Ad-
∂ X
F 2 = p
1 ∂γk (x) h 2
αk − 1
i  
+ 2δed γk (x) βk (xt ) d . vances in Neural Information Processing Systems 27, pages
σ d
∂ [x]e k d 2λk ∂ [x]e
1143–1151. Curran Associates, Inc., 2014. 1, 2
We abbreviate γ` (xt ), α` (xt ) = xtσ−µ
`
`
by dropping the [12] D. G. Lowe. Distinctive image features from scale-invariant
argument. Note, most expressions are sums of dyadic prod- keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.
ucts. Therefore, all gradients are build up on the gradi- 1
ents of the soft-assignment function γk which can be pre- [13] A. Oliva and A. Torralba. Modeling the shape of the scene: A
computed: holistic representation of the spatial envelope. Int. J. Comput.
Vision, 42(3):145–175, May 2001. 1
[14] F. Perronnin and C. Dance. Fisher kernels on visual vocab-
 
∂ δks γs ularies for image categorization. In Computer Vision and
γk = γk − Pattern Recognition, 2007. CVPR ’07. IEEE Conference on,
∂λs λk λs
  pages 1–8, June 2007. 1
∂ x − µs [15] F. Perronnin and C. Dance. Fisher kernels on visual vocab-
γk = γk (δks − γs )
∂µs σs2 ularies for image categorization. In Computer Vision and
 
∂ (x − µs )2 1 Pattern Recognition, 2007. CVPR ’07. IEEE Conference on,
γk = γk (δks − γs ) − 2 . pages 1–8, June 2007. 3
∂σs2 2σs4 2σs
[16] F. Perronnin and D. Larlus. Fisher vectors meet neural net-
References works: A hybrid classification architecture. June 2015. 2
[17] F. Perronnin, J. Snchez, and T. Mensink. Improving the
[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded- fisher kernel for large-scale image classification. In Pro-
up robust features (surf). Comput. Vis. Image Underst., ceedings of the 11th European Conference on Computer Vi-
110(3):346–359, June 2008. 1 sion: Part IV, ECCV’10, pages 143–156, Berlin, Heidelberg,
[2] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material 2010. Springer-Verlag. 3, 4, 5
recognition in the wild with the materials in context database. [18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neuro-
CoRR, abs/1412.0623, 2014. 1 computing: Foundations of research. chapter Learning Rep-
[3] M. Cimpoi, S. Maji, and A. Vedaldi. Deep convolutional resentations by Back-propagating Errors, pages 696–699.
filter banks for texture recognition and segmentation. CoRR, MIT Press, Cambridge, MA, USA, 1988. 4
abs/1411.6836, 2014. 1 [19] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Pri-
[4] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multi-column mal estimated sub-gradient solver for svm. In Proceedings
deep neural networks for image classification. CoRR, of the 24th International Conference on Machine Learning,
abs/1202.2745, 2012. 1 ICML ’07, pages 807–814, New York, NY, USA, 2007.
[5] M. Dixit, S. Chen, D. Gao, N. Rasiwasia, and N. Vasconce- ACM. 6
los. Scene classification with semantic fisher vectors. June [20] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher
2015. 1, 2 networks for large-scale image classification. In C. Burges,
[6] M. Everingham, L. Gool, C. K. Williams, J. Winn, and L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger,
A. Zisserman. The pascal visual object classes (voc) chal- editors, Advances in Neural Information Processing Systems
lenge. Int. J. Comput. Vision, 88(2):303–338, June 2010. 5 26, pages 163–171. Curran Associates, Inc., 2013. 2, 7
[7] X. Glorot and Y. Bengio. Understanding the difficulty of [21] V. Sydorov, M. Sakurada, and C. H. Lampert. Deep fisher
training deep feedforward neural networks. In In Proceed- kernels - end to end learning of the fisher kernel gmm pa-
ings of the International Conference on Artificial Intelligence rameters. June 2014. 2, 4, 5, 6, 7
and Statistics (AISTATS10). Society for Artificial Intelligence [22] J. Snchez, F. Perronnin, T. Mensink, and J. Verbeek. Im-
and Statistics, 2010. 5 age classification with the fisher vector: Theory and practice.
International Journal of Computer Vision, 105(3):222–245,
2013. 2, 3
[23] V. Vapnik and O. Chapelle. Bounds on error expectation
for support vector machines. Neural Comput., 12(9):2013–
2036, Sept. 2000. 1, 2
[24] D. Yoo, S. Park, J. Lee, and I. Kweon. Fisher kernel for deep
neural activations. CoRR, abs/1412.1628, 2014. 1

You might also like