You are on page 1of 8

Dirichlet-based Histogram Feature Transform for Image Classification

Takumi Kobayashi
National Institute of Advanced Industrial Science and Technology
Umezono 1-1-1, Tsukuba, Japan
takumi.kobayashi@aist.go.jp

Abstract derived from the gradient magnitude into the orientation


bins. And, soft weights are effectively employed to de-
Histogram-based features have significantly contributed scribe (quantize) a continuous input space in the form of
to recent development of image classifications, such as by the histogram, such as in soft coding for BoW [40]. Such
SIFT local descriptors. In this paper, we propose a method soft weighting degrades the nature of countable histogram,
to efficiently transform those histogram features for improv- while BoW in document texts is inherently a histogram of
ing the classification performance. The (L1 -normalized) countable words. Thus, the histogram-based feature re-
histogram feature is regarded as a probability mass func- quires appropriate transformation such as normalization to
tion, which is modeled by Dirichlet distribution. Based on form an effective feature vector for image classifications.
the probabilistic modeling, we induce the Dirichlet Fisher In order to transform the feature into the form favorable
kernel for transforming the histogram feature vector. The for classification, the statistical methods such as PCA, ICA
method works on the individual histogram feature to en- and LDA have been widely applied. Those methods pro-
hance the discriminative power at a low computational duce the projection of the feature vectors into a (sub)space
cost. On the other hand, in the bag-of-feature (BoF) frame- based on the statistical criterion which is specific to the ob-
work, the Dirichlet mixture model can be extended to Gaus- jective task. On the other hand, various types of normaliza-
sian mixture by transforming histogram-based local de- tion are also applicable to the histogram-based features for
scriptors, e.g., SIFT, and thereby we propose the method the feature transform. Those methods generally work on the
of Dirichlet-derived GMM Fisher kernel. In the experi- x
features apart from the tasks. The L1 normalization kxk 1
ments on diverse image classification tasks including recog- provides histogram-based features with probabilistic per-
nition of subordinate objects and material textures, the pro- spective, that is, the L1 -normalized histogram is regarded as
posed methods improve the performance of the histogram- the probability mass function over the histogram bins. For
based features and BoF-based Fisher kernel, being favor- x
general image features, the L2 normalization kxk 2
is one
ably competitive with the state-of-the-arts. of the most frequently used feature transform and in recent
√ q
x x
years, L2 -Hellinger normalization k√xk2 = kxk1 is inten-
1. Introduction sively applied [1, 36].1 In this work, we focus on the latter
feature transform that is generally applicable to histogram-
Histogram-based features are fundamental and play a based features, followed by the classification methods spe-
key role for recently developed methods especially for im- cific to the recognition tasks.
age classification. For example, SIFT features are widely The feature transform is also addressed in a form of a
used as a local descriptor [29], HOG features are for ob- kernel function for kernel-based methods [37]. While Gaus-
ject detection [6] and the bag-of-word (BoW) methods, i.e., sian kernel is generally applied to feature vectors, there
visual word histograms, encourage the image classification are some kernels specialized for the histogram-based fea-
in the last decade [5]. In the bag-of-feature (BoF) frame- tures; e.g., χ2 kernel [47] and intersection kernel [23].
work, the SIFT local descriptors are also fed into the Fisher However, the kernel function inevitably requires the kernel-
kernel [36] to significantly improve the performance. based methods that is computationally expensive. Thus, the
The histogram captures the statistical feature basically kernel feature map is proposed in [41, 31] to circumvent
by counting the certain types of symbol. Besides, it is also that issue by representing the (additive) kernel in an explicit
used to measure the significance of those symbols by vot-
ing uncountable weights. In SIFT/HOG, the histogram of 1 The SIFT descriptor with L -Hellinger normalization is known as
2
gradient orientations is constructed by voting the weight RootSIFT [1].

4321
10

linear feature form. The mapping enables us to leverage the 4

2
9
8

kernel’s discriminative power in the linear features, but it 0


1 2 3 4 5 6 7 8 9 10
7
6
5
drastically augments the feature dimensionality to approxi- 4
3

mate the non-linear kernel function, requiring more compu- 2


1

tational cost compared to the original features. 0


1 2 3 4 5 6 7 8 9 10

In this paper, we propose a method to efficiently trans- (a) digamma ψ (b) trigamma ψ 0
form the histogram-based features. From the probabilis- Figure 1. Digamma and trigamma functions.
tic viewpoint, the (L1 -normalized) histogram feature is re-
garded as a probability mass function, which can be mod- PD
with θ0 = i=1 θi . Note that the Dirichlet distribution (1)
eled by Dirichlet distribution. Based on the probabilis-
is defined over the simplex domain of the discrete probabil-
tic model, we induce the Dirichlet Fisher kernel to trans-
ity distribution x as shown in Fig. 4a. The Fisher kernel [15]
form the histogram-based features, enhancing the discrimi-
of the Dirichlet distribution is given by the derivatives of the
native power without augmenting the feature dimensional-
log probabilities w.r.t θ,
ity. Thus, the transformed features are favorably fed into the
linear classifier. And, the method does not require cumber- 
∇θ log p(x; θ) = log(x)− ψ(θ) − ψ(θ0 )1 = log(x)−µθ ,
some parameter tuning; instead, it is generally determined
(2)
based on the statistics of the histogram features, e.g., nat-
where 1 ∈ RD is the vector whose components are all 1, ψ
ural SIFT statistics. Note that the proposed method is ap-
is the digamma function (Fig. 1a), and the underlined func-
plicable to any types of histogram-based features including
tions log and ψ act on respective components of the input
those of countable symbols, e.g., BoW, as well as those con-
vector. The Fisher information matrix is then written by
structed by voting weights, e.g., SIFT, whereas the Pólya
model [4] only accepts the former type of histograms. Z
On the other hand, in the BoF framework, a plenty of H= p(x; θ)∇θ log p(x; θ)∇>θ log p(x; θ)dx
histogram-based local descriptors, e.g., SIFT, are modeled
= diag ψ (θ) − ψ 0 (θ0 )11> ,
 0
by the Dirichlet mixture model. We extend it to the Gaus- (3)
sian mixture model via the feature transform and thereby
propose the method of Dirichlet-derived GMM Fisher ker- where ψ 0 indicates the trigamma function which is the first
nel. The method produces effective image features at a low derivative of ψ (Fig. 1b), and diag(·) produces the diago-
extra computational cost for generic image classification, nal matrix from the input vector. Thus, the Dirichlet Fisher
improving the classification performance. kernel is obtained by the following linear feature form;
Our main contributions are 1) Dirichlet Fisher kernel 1 1
with the parameter determined by natural statistics of the H − 2 ∇θ log p(x; θ) = H − 2 [log(x) − µθ ]. (4)
target histogram features, such as natural SIFT statistics,
2) Dirichlet-derived GMM Fisher kernel for generic im- It should be noted that the Dirichlet Fisher kernel results in
age classifications, and 3) the thorough comparative exper- the same dimensionality D as the original feature x.
iments on a variety of image classification tasks.
2.2. Diagonal approximation
2. Dirichlet Fisher kernel
In the Dirichlet Fisher kernel (4), it is computationally
We propose Dirichlet Fisher kernel to transform a his- exhaustive to apply (inverse of) the full Fisher information
togram feature into a discriminative (linear) form of the matrix H. Therefore, we approximate H in the following
same dimensionality, which is suitable for the subsequent diagonal form;
linear classifier.
H ≈ diag ψ 0 (θ) − ψ 0 (θ0 )1 = diag(σθ2 ).

(5)
2.1. Definition
The L1 -normalized histogram feature x ∈ RD (x ≥ 0, Since the trigamma function ψ 0 (θ0 ) approaches to zero at
kxk1 = 1) is supposed to be drawn from the Dirichlet dis- larger θ0 as shown in Fig. 1b, ψ 0 (θ0 ) is negligible especially
tribution which is defined by in the case of large dimension D, and the Fisher information
D
matrix H is dominated by the diagonal components σθ2 ,
1 Y θi −1 ψ 0 (θ) − ψ 0 (θ0 )1 > 0. Thus, (4) reduces to
p(x; θ) = x , (1)
B(θ) i=1 i
diag(σθ )−1 [log(x) − µθ ], (6)
where θ ∈ RD
+ is the parameter vector and B is the beta
QD
Γ(θi )
function, B(θ) = i=1
Γ(θ0 ) using the gamma function Γ which consists only of component-wise operations.

4322
2.3. Empirical approximation
In the above formulations (4, 6), the parameter θ
could be estimated from the training samples by the EM
method [32]. However, the parameter θ fortunately appears
only in the forms of µθ , σθ which are empirically estimated
as mean and standard deviationR of log(x) as follows. -20 -10 0 log(x) -20 -10 0 log(x+ )
Since Ex [∇θ log p(x;θ)]= p(x;θ)∇θ log p(x;θ)dx = (a) log(x) (b) log(x + )
0, the mean of log(x) is given by Figure 2. Empirical distribution of transformed SIFT features
PD marginalized over feature components. The gray region indicates
Ex [log(x)] = ψ(θ) − ψ( i θi )1 = µθ , (7) the deviations (between min and max) across the datasets and the
red solid curve is the mean distribution.
and then the diagonal Fisher information matrix results in 0 1

0.9
the variance of log(x) as −2 0.8

0.7

log(x+ǫ)
−4 0.6
Z
0.5
2
σθi = Hii = p(x; θ) {log(xi ) − Ex [log(xi )]} dx −6
0.4

0.3
−8
0.2

x
= V arx [log(xi )]. (8) −10
−15 −10 −5 0
0.1

0
log(x + ǫ) − log(ǫ)
log(x) 0 0.2 0.4
x
0.6 0.8 1


Those mean and variance of log(x) are empirically esti- (a) from log(x) to log(x + ) (b) log(x + ) vs x
Figure 3. Transform function. (a) modification of log by introduc-
mated from the training samples {xj }N
j=1 by

ing , and (b) comparison to x with scaling into [0, 1].
N
1 X
µ̂ = Ex [log(x)] = log(xj ), (9) The marginal of the Dirichlet distribution is given by the
N j=1
Beta distribution;
N
1 X 2 xθi −1 (1 − x)θ0 −θi −1 xα−1 (1 − x)β−1
σ̂i2 = V arx [log(xi )] = log(xji ) − µ̂i , (10) p(x; θ) = = (12)
N j=1 B(θi , θ0 − θi ) B(α, β)
= p(x; α, β), (13)
where xji denotes the i-th component of the j-th sample
vector xj . These statistics are efficiently and stably com- where we consider the marginal across all feature compo-
puted in contrast to the model parameter estimation for θ nents under the assumption that the parameters θi are not so
by the EM method. The Dirichlet Fisher kernel is finally drastically changed over i. Then, we transform the variable
obtained via these empirical approximations (9, 10) by x into v = log(x) to modify (13) into
diag(σ̂)−1 log(x) − µ̂ .

(11) exp(αv){1 − exp(v)}β−1
p(v; α, β) = . (14)
B(α, β)
This resultant formulation is quite simple and efficient with-
out augmenting the feature dimensionality unlike the kernel For example, Fig. 2a shows the empirical distributions of
map [41, 31]. This might also be regarded as standardiza- SIFT extracted from various datasets used in Sec.4.3, omit-
tion for log(x), not for x. It should be noted that this stan- ting the features of x = 0. We can find that the marginal dis-
dardization is theoretically derived from the Fisher kernel tributions actually result in almost the same form of the uni-
of the Dirichlet model. modal (modified) Beta distribution regardless of the dataset
sources. Thus, this is regarded as the natural SIFT statis-
2.4. Modified log-function
tics, similarly to the natural image statistics [20]. Based on
The “log” function plays an important role in the Dirich- this distribution, we give  in a general form.2
let Fisher kernel (11). It, however, causes a numerical issue Modifying log(x) into log(x+) corresponds to the vari-
at zero, approaching −∞ as x → 0, and some histogram able transform by ṽ = log{exp(v) + } (Fig. 3a), which
bins would practically result in empty. To prevent it, log(x) accordingly reformulates the probability distribution into
is usually modified to log(x + ) using small fraction pa-
rameter   1. In this section, we address the effect of {exp(ṽ) − }α−1 {1 +  − exp(ṽ)}β−1 exp(ṽ)
p(ṽ; α, β) = .
the fraction  from the viewpoint of probability distribution B(α, β)
over the logarithm of the variable, which has not been inten- (15)
sively discussed so far; especially, we mention the natural 2 Here, we demonstrate the case of SIFT, but we empirically confirmed

SIFT statistics. that it holds for the other types of histogram features.

4323
The empirical distribution of so transformed SIFT is shown categorization and then presented in [4] for visual classifica-
in Fig. 2b. The function log(x + ) works in the follow- tion. The Pólya model accounts for symbol (word) bursti-
ing two aspects; First, it roughly rounds off the log(θ) of ness, which is measured discretely, by means of the com-
θ <  into log() as shown in Fig. 3a. Tiny histogram pound of Dirichlet and multinomial models [30]. Thereby,
values are regarded as zeros, exhibiting the negative evi- it is suitable for transforming the histogram composed of
dence [16], while the differences of the moderately small countable symbols, e.g., BoW, but is inapplicable to those
x are enhanced like a local contrast enhancement (Fig. 3b). of uncountable voting weights, e.g., SIFT/HOG. In contrast,
Second, by adequately determining , v̂ = log(x + ) is the proposed method deals with various types of features
smoothly distributed above log() (lower bound) while pre- into which those countable/uncountable histograms are L1 -
serving the behavior around the mode of distribution. Too normalized. In addition, the Pólya method [7, 4] inevitably
small  pushes the lower bound log() far away from the requires to learn hyper parameters, which is computation-
mode, which emphasizes the negative evidence too much. ally exhaustive in the case of large-scale high-dimensional
On the other hand, larger  unfavorably merges the nega- histogram features.
tive evidence and the mode. The appropriate  renders the
smooth distribution between the mode and the lower bound 3. Extension to BoF-based Fisher kernel
(negative evidence). For that purpose, we determine  based
In Sec.2, we proposed the Dirichlet Fisher kernel that is
on the cumulative percentile of the empirical distribution;
applied to individual histogram features. In this section, it is
for general setting, we suggest 25% percentile, e.g., pro-
extended in the bag-of-feature framework to the BoF-based
ducing R = P−1 (0.25) ≈ 0.001 for SIFT descriptors where
 Fisher kernel which has exhibited promising performance
P() = 0 p(x)dx.
on image classifications [36].
2.5. Discussion BoF represents an image by plenty of local descriptors
extracted at various points (grid points) with various scales
The Dirichlet Fisher kernel is connected to tf-idf as fol- in the image. Suppose we employ histogram-based local
lows. Roughly speaking, the digamma function ψ(θ) is ap- descriptors, typically SIFT descriptors, the distribution of
proximated by log(θ) at large θ3 , which further provides the which is modeled by the Dirichlet distribution as described
following approximations: in the previous section.
 
θ 3.1. Dirichlet mixture model (DMM)
ψ(θ) − ψ(θ0 )1 ≈ log(θ) − log(θ0 )1 = log
θ0
 It is straightforward to apply the Dirichlet mixture model
= log Ex [x] = log(x̄), (DMM) to describe the bag of the local descriptors;
   
1 1 1 θ0 1 1 K D
ψ 0 (θi ) − ψ 0 (θ0 ) ≈ − = −1 = −1 , X 1 Y θki −1
θi θ0 θ0 θi θ0 x̄i p(x; {θk }Kk=1 ) = ω k x , (17)
B(θk ) i=1 i
k=1
where we use x̄ , Ex [x] = θθ0 , the mean of x. Thus, the PK
i-th component in the Dirichlet Fisher kernel (6) is approx- where ωk are the prior weights, k ωk = 1, ωk ≥ 0 ∀k.
imately reformulated to This modeling naturally derives the following DMM Fisher
kernel by using (6),
   
1 1 xi N
− 1 log . (16) 1 X
θ0 x̄i x̄i p(k|xi )diag(σθk )−1 log(xi ) − µθk },

Gk = √
N ωk i
This is some sort of tf-idf [35]; apart from the constant (18)
1
θ0 , the feature xi is first normalized by its mean and then
q
weighted by x̄1i −1. Those weights emphasize the compo- µθk = ψ(θk ) − ψ(θk0 )1, σθk = ψ 0 (θk ) − ψ 0 (θk0 )1.
nents of low means x̄i which correspond to the rarely ob-
Note that the k-th part of DMM Fisher kernel Gk is the D-
served symbols. Such weighting that prefers rare symbols
dimensional vector since the Dirichlet model contains only
more than common ones is motivated in the same way as tf-
the parameters θk ∈ RD .
idf. Therefore, the Dirichlet Fisher kernel works similarly
to tf-idf for enhancing the discriminativity in the features. 3.2. Dirichlet-derived GMM
The Dirichlet Fisher kernel is also related to the Pólya
We rewrite the Dirichlet distribution (1) by using the es-
Fisher kernel which has been first proposed in [7] for text
sential form of v = log(x) as
3 ψ(θ) is usually approximated by log(θ − 0.5) for θ ≥ 1. However,

in order to give light on the connection to tf-idf, we roughly approximate exp{θ > v} exp{θ > ṽ}
it by log(θ).
p(v; θ) = ≈ = p(ṽ; θ), (19)
B(θ) B(θ)

4324
x3
1 1
where ωk , µk , σk are the GMM parameters estimated by
applying EM method to the transformed descriptors z =
log(x+)−log()1
k log(x+)−log()1k2 . Without applying L2 normalization in
z, (22) reduces to the DMM Fisher kernel (18). Thus, the
x2 DMM Fisher kernel is regarded as the subset of the pro-
posed Dirichlet-derived GMM Fisher kernel.
x1
From the viewpoint of transforming the SIFT descrip-
log(x+)−log()1
(a) x (simplex) (b) log(x + )−log()1 (c) klog(x+)−log()1k2 tors, our method is related to RootSIFT [1] which has also
Figure 4. Feature transform. (a) input histogram feature, (b) trans- been applied to Fisher kernel [18]. RootSIFT is the SIFT
formed feature by modified log and (c) transformed feature so as
q that is transformed by L2 -Hellinger normaliza-
descriptor
to be suitable for GMM. Pseudo colors show the correspondence tion, x
kxk1 . In the RootSIFT, the deviations around the
among three feature spaces. This figure is best viewed in color.
smaller feature values are enhanced by means of the square
root, while the proposed transform further enlarges them
where ṽ = log(x + ). Since x lies on the simplex in (Fig. 3b) with pushing the tiny values into the negative evi-
RD (Fig.P 4a), its transform ṽ is distributed on the convex dence as described in Sec.2.4.
surface ( i eṽi = 1) as shown in Fig. 4b. We roughly ap-
proximate the convex surface by using the sphere surface 4. Experimental results
in the positive orthant space (Fig. 4c). These two surfaces
are homeomorphic and furthermore similarly convex. This We evaluate the proposed methods5 on a variety of im-
approximation leads to age classification tasks by applying the linear SVM classi-
fier. Note that the fraction  is determined based on the 25%
p(ṽ; θ) ∝ exp θ > {ṽ − log()1} percentile of the marginal feature distribution (Sec.2.4).
 
(20)
 
ṽ − log()1 4.1. Joint histogram of gradient orientation
≈ exp η > , (21)
kṽ − log()1k2
The proposed Dirichlet Fisher kernel (Sec.2) is applied
where η is the parameter vector that scales θ to compen- to transform the gradient local auto-correlation (GLAC)
sate the difference in the radii of the two convex surfaces. feature [19] for discriminating the pedestrian images on
(21) implies the von Mises Fisher distribution [22] w.r.t Daimler-Chrysler dataset [33]. The feature is computed by
ṽ−log()1
z = kṽ−log()1k which is often replaced with the Gaus- the joint (co-occurrence) histogram of local gradient orien-
2

sian distribution [12]4 . As a result, we insist that GMM tations with uncountable voting weights, to which the Pólya
is applicable to model the bag of the transformed local de- model [4] is inapplicable.
log(x+)−log()1
scriptors which are L2 -normalized k log(x+)−log()1k , and Daimler-Chrysler pedestrian [33]. The task is to clas-
2
sify image patches of 18 × 36 pixels into a pedestrian (pos-
consequently the Dirichlet-derived GMM Fisher kernel is
itive) or clutter (negative). For details of the dataset and the
proposed by applying the GMM Fisher kernel [36] to those
evaluation protocol, refer to [33]. The GLAC feature of the
transformed local descriptors.
setting in [19] is extracted respectively from 2 × 4 spatial
It is advantageous to model the distribution of local de-
bins over the 18 × 36 image, resulting in 2,592 dimensions.
scriptors by using Gaussian (GMM) in the following points.
As is the case with Sec.2.4, the empirical feature distri-
First, the von Mises Fisher distribution (21), getting back to
butions on the pedestrian and non-pedestrian sets result in
Dirichlet distribution, is circularly symmetric around η and
almost the same distribution with a small deviations, based
unable to characterize the anisotropic dispersity orthogonal
on which the fraction  is determined;  = P−1 (0.25).
to η, while the Gaussian model can exploit it by variance;
The performance results are shown in Fig. 5 with com-
the GMM Fisher kernel [36] produces two types of features
parison to the other types of feature transforms; L1 nor-
related to the variance and the mean, which doubles the x x min[x/kxk2 ,τ ]
feature dimensionality of the DMM Fisher kernel (18). The malization kxk 1
, L2 kxk 2
, L2 -hysteresis k min[x/kxk 2 ,τ ]k2

x
DMM Fisher kernel (18) is similar to the proposed Fisher with τ = √1D 6 , L2 -Hellinger k√xk 2
, standarization
kernel derived from the mean, −1
diag(σX ) (x − µX ) with the mean µX and standard de-
N viation σX estimated from x, and the explicit feature map
1 X h i
of χ2 kernel proposed in [41]. Fig. 5 shows the perfor-
√ p(k|xi )diag(σk )−1 zi − µk , (22)
N ωk i mance results measured by ROC and equal error rate (EER).
4 Actually, although the SIFT descriptors [29] are L -hysteresis nor- 5 Codes are available at https://staff.aist.go.jp/
2
malized, forming spherical distributions, the GMM is directly applied on takumi.kobayashi/codes.html#DirFT
those distribution in the GMM Fisher kernel [36]. 6 In the case of SIFT, τ = 0.2 as suggested in [29].

4325
1
ours Method EER (%) Table 1. Performance on VOC2007 using BoW.
0.99 L2 norm
L2 -hys. norm
ours 94.95 Methods mAP (%) Methods mAP (%)
0.98 L2 -Hel. norm
0.97
standarize
kernel map L1 norm 78.26 ours 60.91 Pólya [4] 60.58
0.96
L 2 norm 93.91 L1 norm 51.93 standarize 58.16
0.95

0.94 L2 -hys. norm 94.31 L2 norm 53.50 kernel map [41] 60.67
0.93 L2 -Hel. norm 94.63 L2 -hys. norm [36] 59.58 BoW-LLC [44] 57.37
0.92
standarize 91.31 L2 -Hel. norm [1] 59.10
0.91
kernel map [41] 94.89
0.9
0 0.02 0.04 0.06 0.08 0.1 Table 2. Performance on 64
Figure 5. Performance result (ROC) on Daimler Chrysler pedes-
VOC2007 using BoF-based FK.
trian dataset [33]. 63.5

Methods mAP (%)


63
ours 63.83

mAP (%)
ours
The proposed method outperforms the other normalization DMM-FK 57.51 baseline FK (L2 -hys.)
62.5

methods and is competitive with [41]. It should be noted log(x + ) 63.48 62

that, while the method [41] augments the feature dimen- L1 norm 59.63 61.5

sionality D into 7D significantly increasing the computa- L2 norm 60.03 61


10 20 25 30 40 50
tion time in classification, our method efficiently operates L2 -hys. norm [36] 61.40 Percentile (%) for ε

on respective feature components at a low computational L2 -Hel. norm [1] 63.04 Figure 6. Performance of various
cost without enlarging the dimensionality. kernel map [41] 61.95  on VOC2007.

4.2. Bag-of-words histogram


recognition [38]. The experimental setup for the BoF-based
We next apply the Dirichlet Fisher kernel to the bag-of- Fisher kernel is the same as in Sec.4.2 except that K = 256
word (BoW) feature which is composed of countable word GMM is obtained by EM method instead of visual words.
histogram. We test the method on PASCAL-VOC2007 We analyze the performance of the proposed method
dataset [8] in the following experimental setup. from various aspects by using the VOC2007 dataset [8].
A lot of SIFT local descriptors [29] are densely extracted The proposed method is compared to DMM Fisher kernel
at spatial grid points in 4 pixel step with three scales of (18) as well as the other types of transformation of SIFT
{16, 24, 32} pixels. To construct 16,384 visual words, we features. In this case, L2 -hysteresis normalization results in
apply k-means clustering to a million of (transformed) SIFT the baseline Fisher kernel [36] and L2 -Hellinger normaliza-
descriptors randomly drawn from the training images. An tion produces RootSIFT [1]. The comparison results in Ta-
image is partitioned into sub-regions in three levels of spa- ble 2 show that the proposed method outperforms the others.
tial pyramid as 1×1, 2×2 and 3×1; the BoW histogram fea- The method of DMM-FK is inferior since it corresponds to
tures are computed on the respective sub-regions and then only the one part (mean) of the proposed FK as described in
concatenated into the image feature vector which is finally Sec.3.2. Its extension to GMM-FK using the transform of
fed into the linear SVM classifier. log(x + ) improves the performance by incorporating the
PASCAL-VOC2007 [8]. The dataset contains object FK features derived from the variance. Nonetheless, the
images of 20 categories with large variation regarding ap- proposed method is still better due to transforming SIFT
pearances and poses as well as complex backgrounds. We features in accordance with von Mieses Fisher which ac-
follow the standard VOC evaluation protocol. tually reduces to Gaussian. It should be noted again that
The performance results are shown in Table 1 with the the method of [41] augments the SIFT dimensionality to
comparison similarly to Sec.4.1, demonstrating that the pro- 896 = 128 × 7, increasing the computational cost.
posed method is superior to the other normalization meth- We then investigate the effects of  which is determined
ods. It slightly outperforms the method of Pólya model [4], based on the percentile of the distribution in Fig. 2a. Fig. 6
even though our method is so general as to accept not only shows the performance of various  with comparison to the
this countable BoW histogram but also the other types of baseline FK of L2 -hysteresis normalization [36]. The best
histogram features as in Sec.4.1. For comparison, we also performance is obtained at the 25% percentile, though all
applied the BoW method using LLC [44], and the proposed values of  produces superior performance to the baseline.
method significantly outperforms it. By using  of the 25% percentile, the distribution of the fea-
tures is transformed as shown in Fig. 2b; The smaller feature
4.3. Fisher kernel in bag-of-feature framework
values are favorably rounded-off into log() (the negative
We finally apply the proposed Dirichlet-derived GMM evidence) while the higher features are diversely distributed
Fisher kernel (Sec.3.2) to various image classification tasks; to exploit the discriminative information. Throughout the
object recognition [8, 11], fine-grained object classifica- experiments, we apply the proposed method with  fixed
tion [43], scene categorization [34, 45], event classifi- to the value of 25% percentile, i.e.,  = 0.001 for SIFT,
cation [24], aerial image classification [46] and material though fine tuning such as by cross validation per dataset

4326
would further improve the performance. protocol of [45], we used 50 training and 50 test samples
The proposed method produces 63.83% which is supe- per category and measured the classification accuracies av-
rior to the VOC2007 winner (59.4%) and is favorably com- eraged over the given 10 training/test partitions.
pared to the state-of-the-art result (63.5%) in [13] that re- The performance results on these datasets are shown in
sorts to time-consuming object detection. Table 3. The proposed method is favorably competitive with
In the following experiments on various datasets, the the other feature transforms, outperforming the other meth-
proposed Dirichlet-derived GMM FK is compared with the ods; compared to the baseline of L2 -hysteresis normaliza-
other types of normalization as well as the state-of-the-art tion, the performance is improved by 2 ∼ 3%. The proposed
results reported in the referenced papers. method surely improves the performance of the Fisher ker-
MIT-Scene [34]. This dataset contains 15,620 images in nel by effectively transforming the local descriptors at a low
67 indoor scene categories with the large within-class and extra computational cost.
small between-class variability. We report the classification
accuracy according to the experimental setting in [34].
5. Conclusion
UIUC-Sports [24]. For image-based event classifica- We have proposed methods to transform histogram-
tion, Li and Fei-Fei [24] collected 1,792 images of eight based features for improving classification performance.
sport categories, each of which contains 137∼250 images The (L1 -normalized) histogram features regarded as prob-
with large variations of poses and sizes across diverse cat- ability mass functions are modeled by using Dirichlet dis-
egories in the cluttered backgrounds. Following the exper- tribution from the probabilistic perspective. Based on the
imental setup used in [24], we report the averaged classifi- probabilistic model, we induce the Dirichlet Fisher ker-
cation accuracies. nel which transforms the individual histogram features, and
Land-Use [46]. Yand et al. [46] collected the dataset of in the BoF framework, the Dirichlet mixture model is ex-
aerial orthoimagery downloaded from the United States Ge- tended to GMM via the feature transform, which leads to
ological Survey (USGS) National Map in 21 land-use cat- the Dirichlet-derived GMM Fisher kernel. The proposed
egories. Each category contains 100 images of 256 × 256 methods do not require cumbersome parameter tuning; it is
pixels in a resolution of one foot per pixel, including a va- generally determined based on the statistics of the features,
riety of spatial patterns. We follow the experimental and e.g., natural SIFT statistics. In the experiments on a va-
evaluation protocol in [46]. riety of image classifications tasks, the proposed methods
Flickr Material [38]. There are 1,000 images of ten ma- favorably improve the performance of the histogram-based
terial categories and human-labeled binary masks associ- features and the BoF-based Fisher kernel.
ated with images that describe the location of the object. We
extract local SIFT descriptors on the foreground indicated References
by the binary mask for material recognition. According to [1] R. Arandjelović and A. Zisserman. Three things everyone should
the evaluation protocol in [26], the averaged classification know to improve object retrieval. In CVPR, 2012.
accuracies are reported. [2] L. Bo, X. Ren, and D. Fox. Hierarchical matching pursuit for image
classification: Architecture and fast algorithms. In NIPS, 2011.
CUB200-2011 [43]. This is a challenging dataset of
[3] L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchi-
200 bird species, including 11,788 images in total, for fine- cal matching pursuit. In NIPS workshop on deep learning, 2012.
grained (subordinate) object recognition. We used full un- [4] R. G. Cinbis, J. Verbeek, and C. Schmid. Image categorization using
cropped images and evaluated the methods by using the pro- fisher kernels of non-iid image models ramazan. In CVPR, 2012.
[5] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization
vided training/test split.
with bags of keypoints. In ECCV Workshop on Statistical Learning
Caltech-256 [11]. This dataset is composed of 30,607 in Computer Vision, 2004.
images in 256 object categories. There are large intra- [6] N. Dalal and B. Triggs. Histograms of oriented gradients for human
class variances regarding such as object locations, sizes and detection. In CVPR, 2005.
[7] C. Elkan. Deriving tf-idf as a fisher kernel. In SPIRE, 2005.
poses in the images, which makes this dataset a challenging
[8] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A. Zisserman.
benchmark dataset for object recognition. According to the The PASCAL Visual Object Classes Challenge 2007 Results.
standard experimental setting, we randomly pick up 15, 30, [9] S. Gao, I.-H. Tsang, L.-T. Chia, and P. Zhao. Local features are not
45, and 60 training images per category and (at most) 50 im- lonely - laplacian sparse coding for image classification. In CVPR,
2010.
ages for test. We report the averaged classification accuracy
[10] P. Gehler and S. Nowozin. On feature combination for multiclass
over three trials. object classificatio. In ICCV, 2009.
SUN-397 [45]. This dataset contains roughly 100K im- [11] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category
ages of 397 scene categories covering as many of visual dataset. Technical Report 7694, Caltech, 2007.
[12] O. C. Hamsici and A. M. Martinez. Spherical-homoscedastic dis-
world scenes as possible. It is a challenging dataset since tributions: The equivalency of spherical and normal distributions in
even “good” human workers in AMT achieve the classifi- classification. Journal of Machine Learning Research, 8:1583–1623,
cation performance of 68.5% on an average. Following the 2007.

4327
Table 3. Performance comparison on various datasets.
(a) MIT-Scene [34] (b) UIUC-Sports [24] (c) Land-Use [46] (d) Flickr Material [38]
Method Acc. (%) Method Acc. (%) Method Acc. (%) Method Acc. (%)
Ours 63.4 Ours 92.6±0.7 Ours 92.8±0.9 Ours 57.3±0.9
L2 -hys. norm [36] 60.3 L2 -hys. norm [36] 91.3±1.3 L2 -hys. norm [36] 92.2±0.9 L2 -hys. norm [36] 55.2±0.9
L2 -Hel. norm [1] 60.4 L2 -Hel. norm [1] 92.2±1.0 L2 -Hel. norm [1] 92.1±1.1 L2 -Hel. norm [1] 56.3±0.8
kernel map [41] 61.6 kernel map [41] 92.3±1.4 kernel map [41] 92.2±0.6 kernel map [41] 56.8±1.2
Bo et al. [2] 41.8 Liu et al. [28] 84.6±1.5 Yand et al. [46] 77.4 Liu et al. [26] 44.6
Zheng et al. [48] 47.2 Gao et al. [9] 85.3±0.5 Jiang et al. [17] 77.8 Li and Fritz [25] 48.1
Bo et al. [3] 50.5 Bo et al. [2] 85.7±1.3 Liu et al. [27] 48.2
Juneja et al. [18] 63.1 Zheng et al. [48] 87.2 Hu et al. [14] 54
(e) CUB200-2011 [43] (f) Caltech-256 [11] (g) SUN-397 [45]
Method Acc. (%) #sample 15 30 45 60 Method Acc. (%)
Ours 27.3 Ours 41.8±0.2 49.8±0.1 54.4±0.3 57.4±0.4 Ours 46.1±0.1
L2 -hys. norm [36] 25.8 L2 -hys. norm [36] 39.8±0.3 48.0±0.3 52.4±0.3 55.4±0.2 L2 -hys. norm [36] 43.4±0.3
L2 -Hel. norm [1] 26.6 L2 -Hel. norm [1] 41.2±0.3 49.5±0.2 53.9±0.4 56.8±0.3 L2 -Hel. norm [1] 45.2±0.2
kernel map [41] 26.6 kernel map [41] 40.3±0.1 48.5±0.2 52.9±0.3 55.9±0.4 kernel map [41] 44.3±0.2
Wah et al. [43] 10.3 Gehler et al. [10] 34.2 45.8 - - Shen et al. [39] 23.1
Wah et al. [42] 17.3 Kulkarni and Li [21] 39.4 45.8 49.3 51.4 Xiao et al. [45] 38.0
Bo et al. [3] 40.5 ±0.4 48.0 ±0.2 51.9 ±0.2 55.2 ±0.3

[13] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient object [33] S. Munder and D. M. Gavrila. An experimental study on pedestrian
localization and image classification. In ICCV, 2009. classification. IEEE Transaction on Pattern Analysis and Machine
[14] D. Hu, L. Bo, and X. Ren. Toward robust material recognition for Intelligence, 28(11):1863–1868, 2006.
everyday objects. In BMVC, 2011. [34] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR,
[15] T. S. Jaakkola and D. Haussler. Exploiting generative models in dis- 2009.
criminative classifiers. In NIPS, 1999. [35] S. Robertson. Understanding inverse document frequency: On theo-
retical arguments for idf. Journal of Documentation, 60(5):503–520,
[16] H. Jégou and O. Chum. Negative evidences and co-occurrences in
2004.
image retrieval: The benefit of pca and whitening. In ECCV, 2012.
[36] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image clas-
[17] Y. Jiang, J. Yuan, and G. Yu. Randomized spatial partition for scene
sification with the fisher vector: Theory and practice. International
recognition. In ECCV, 2012.
Journal of Computer Vision, 105(3):222–245, 2013.
[18] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that [37] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press,
shout: Distinctive parts for scene classification. In CVPR, 2013. 2001.
[19] T. Kobayashi and N. Otsu. Image feature extraction using gradient [38] L. Sharan, R. Rosenholtz, and E. Adelson. Material perception:
local auto-correlations. In ECCV, 2008. What can you see in a brief glance? Journal of Vision, 9(8):784,
[20] D. Krishnan and R. Fergus. Fast image deconvolution using hyper- 2009.
laplacian priors. In NIPS, 2009. [39] L. Shen, S. Wang, G. Sun, S. Jiang, and Q. Huang. Multi-level dis-
[21] N. Kulkarni and B. Li. Discriminative affine sparse codes for image criminative dictionary learning towards hierarchical visual catego-
classification. In CVPR, 2011. rization. In CVPR, 2013.
[22] K.V.Mardia and P. Jupp. Directional Statistics (2nd edition). John [40] J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W.
Wiley and Sons Ltd., 2000. Smeulders. Kernel codebooks for scene categorization. In ECCV,
2008.
[23] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories. [41] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit
In CVPR, 2006. feature maps. In CVPR, 2010.
[42] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recogni-
[24] L.-J. Li and L. Fei-Fei. What, where and who? classifying events by
tion and part localization with humans in the loop. In ICCV, 2011.
scene and object recognition. In ICCV, 2007.
[43] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The
[25] W. Li and M. Fritz. Recognizing materials from virtual examples. In caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-
ECCV, 2012. 2011-001, California Institute of Technology, 2011.
[26] C. Liu, L. Sharan, E. Adelson, and R. Rosenholtz. Exploring features [44] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-
in a bayesian framework for material recognition. In CVPR, 2010. constrained linear coding for image classification. In CVPR, 2010.
[27] L. Liu, P. Fieguth, G. Kuang, and H. Zha. Sorted random projections [45] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun
for robust texture classification. In ICCV, 2011. database: Large-scale scene recognition from abbey to zoo. In
[28] L. Liu, L. Wang, and X. Liu. In dense of soft-assignment coding. In CVPR, 2010.
ICCV, 2011. [46] Y. Yang and S. Newsam. Spatial pyramid co-occurrence for image
[29] D. Lowe. Distinctive image features from scale invariant features. classification. In ICCV, 2011.
International Journal of Computer Vision, 60:91–110, 2004. [47] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local fea-
[30] R. E. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness tures and kernels for classification of texture and object categories:
using the dirichlet distribution. In ICML, 2005. A comprehensive study. International Journal of Computer Vision,
73(2):213–238, 2007.
[31] S. Maji and A. C. Berg. Max-margin additive classifiers for detection.
[48] Y. Zheng, Y.-G. Jiang, and X. Xue. Learning hybrid part filters for
In ICCV, 2009.
scene recognition. In ECCV, 2012.
[32] T. Minka. Estimating a dirichlet distribution. Technical report, 2000.

4328

You might also like