Professional Documents
Culture Documents
2, FEBRUARY 2013
537
I. I NTRODUCTION
among which the Bag of Words (BoW) based ones [1], [2],
[5] present outstanding simplicity and effectiveness.
BoW image representation is typically generated via following three steps: 1) extract local features of an image on
the interest points; 2) generate a dictionary/codebook and then
quantize/encode the local features into codes accordingly; and
3) pool all the codes together to generate the global image
representation. Such a process can be summarized as a feature
extraction-coding-pooling pipeline. And it has been widely
used in recent image classification methods and achieves
impressive performance [1], [2], [7].
Within the above framework, the coding process will
inevitably introduce information loss due to the feature quantization. Such undesirable information loss severely damages
the discriminative power of the generated image representation and thus decreases the image classification performance.
Therefore, various coding methods are proposed to more
accurately encode local features with less information loss.
Most of these methods are developed from the Vector Quantization (VQ) which conducts hard assignment in the coding
process [5]. In spite of great simplicity, its inherent large
coding error1 often leads to unrecoverable loss of discriminative information and severely limits the classification performance [8]. To alleviate this issue, various coding methods
have been proposed. For example, soft-assignment [6], [9],
[10] estimates memberships of each local feature to multiple visual words instead of a single one. Another modified
method is Super Vector (SV) coding [11], which additionally
incorporates the difference between local feature and selected
visual word. Thus SV captures the higher-order information
and shows improved performance.
Though many coding methods [1], [2], [10], [11] are
proposed to accurately represent the input features, the information loss in the feature quantization for coding is still
inevitable. In fact, Boiman et al. [8] have pointed out that the
local features from long-tail distribution are inherently inappropriate for quantization, and the lost information in feature
quantization is quite important for good image classification
performance. To tackle this issue, the Naive Bayes Nearest
Neighbor (NBNN) method is proposed to avoid the feature
coding process, by employing the image-to-class distance
for image classification [8]. Benefiting from alleviating the
information loss, NBNN is able to achieve competitive classification performance on multiple datasets with coding based
methods. Motivated by its success, several methods [12][14]
are developed to further improve the NBNN. However, all
variants of NBNN practically employ uniform summation to
aggregate image-to-class distances calculated based on local
1 Or called the coding residual, which refers to the difference between
original local feature and the reconstructed feature from the produced codes.
538
SPM
Images
Linear
Coding
Local features
Maxpooling
Class 1 Class 2
Class K
Class Manifolds
Distance
Transformaon
Coding &
Pooling
Image
Representaon
Fig. 1. Illustration of linear distance coding. The local features extracted from
various classes of training images are first used to generate a manifold for each
class that is represented by a set of local features (i.e., anchor points). Based
on the obtained class manifolds, the local feature xi is transformed into a more
discriminative distance vector di = [di,1 , di,2 , . . . , di,K ]T , where K denotes
the class number. On these transformed distance vectors, the linear coding
and max-pooling are performed to produce the final image representation.
The principle of the distance transformation from original local feature xi
to distance feature di is to form a class-manifold coordinate system with
the K obtained class manifolds, where each class corresponds to one axis.
For the kth class manifold M k , the coordinate value di,k of local feature xi
corresponds to the distance between xi and this class manifold. Image best
viewed in color.
539
xQ
=
exp 2 x x j
L
2
j =1
c = arg min
c
N
(4)
i=1
540
(5)
subject to : v i, j = 0
1T vi = 1,
if mcj
/ Nik
i
(6)
541
Let di =
distance
distance
vector of the
relationship to all K classes. In contrast to original local
features (e.g., SIFT), which describe the appearance patterns of
characteristic object, the distance vector represents a relative
pattern that captures the discriminative part of local features
w.r.t. specified classes, i.e., it is more class-specific as we
desired. In fact, the distance vector is the projection residue of
local features onto the class manifolds, as shown in Figure 1.
Note that in the figure each axis denotes one class manifold.
Through such residue-pursuit feature transformation, the distance vector gains the following advantages compared with
original local features:
1) The distance vector preserves the discriminative information of local features lost in the traditional feature
coding process.
2) The distance vector can coordinate better with the
additional operation to explore useful spatial information, e.g., SPM. The spatial pooling of traditional local
features requires the involved images have similar object
layout such that the resulting representations of different
images can be well element-wisely matched. Such over
strict requirement is significantly relieved by the distance
vector because of the class-specific characteristic of the
adopted image-to-class distance, as shown in Figure 2.
Compared with previous NBNN methods which directly
sum up the image-class distance for classification, here we
propose to use the distance vector as a new kind of local
feature. Thus, any classification model used on the original
local features can perfectly fit for the distance vector.
Before providing more robust and discriminative distance
pattern, we first recall the original NBNN strategy for image
classification. Given an image I with N local features xi ,
the distance vectors di R K are calculated as in (5). Then
the estimated category c of I is determined by the following
criterion:
N
N
N
N
c = arg min
di = arg min
di,1 ,
di,2 , . . . ,
di,K
i=1
i=1
i=1
i=1
(7)
where k is the index of element corresponding to the category.
Namely, the original NBNN method just separately considers
the element-wise semantic of the obtained distance vector,
and completely ignores the intrinsic pattern described by the
distance vector.
Different from the previous methods, we regard each
distance vector as an integral feature, and then apply the
outperforming coding model on these transformed features.
In particular, the final used distance pattern in our method
Image-level
representaon space
Feature
representaon
Example
in Class 1
Image 1
Image 2
Feature
space
Class 1
Class 2
Class 1
Class 2
Distance features
di = di min(di ),
1
d i = f n (di ) =
[di,1 , di,2 , . . . , di,K ]T
di 2
(8)
542
(a)
Local Feature
Extracon
Linear
Coding
Spaal
Max-pooling
LDC
Spaal
Max-pooling
Concatenated
Representaon
Distance
Transformaon
Linear
SVM
(b)
Fig. 3. Overview of the image classification flowchart. This architecture
has been proven to achieve state-of-the-art performance on the basis of a
single type of feature, e.g., LLC [2]. (a) Linear coding and max-pooling
are sequentially performed on original extracted local features, resulting
in an original image representation. (b) All local features are transformed
into distance vectors, on which the linear coding and max-pooling are
sequentially performed. This coding process is called LDC in this paper,
and it results in a distance image representation. Finally, the original image
representation and the distance image representation are simply concatenated
so that they complement each other, where linear SVM is adopted for the
final classification.
yi
subject to : 1T yi = 1,
(9)
(10)
V. E XPERIMENTS
In this section, we evaluate the performance of the proposed
method on three groups of benchmark datasets: specific objects
(e.g., flower, food), scene and general objects. In particular, the
specific object datasets include Flower 102 [20] and PFID
61 [21], in which the images are relatively clean without
cluttered background. The scene datasets include Scene 15 [7]
and Indoor 67 [22]. And the general object datasets include
Caltech 101 [23] and Caltech 256 [24].
Among various feature coding models producing relatively
compact image representations, Locality-constrained Linear
Coding (LLC) and Localized Soft-Assignment Coding (LSA)
543
(a)
(b)
Fig. 5. Example images of Flower 102 dataset, where each row represents one
category. (a) Original images. (b) Corresponding segmented images. Limited
by the performance of the segmentation algorithm, the segmented images may
contain part of the background, lose part of the object, or even lose the whole
object. Image best viewed in color.
544
TABLE I
C LASSIFICATION A CCURACY (%) C OMPARISON ON
T WO O BJECT D ATASETS Flower 102 AND PFID 61
Methods
Fig. 6. Example images of PFID 61 dataset, where each row of the left
and right part represents one category. Each category contains three instances
and each instance has six images from different views. Two images of each
instance are shown here. Image best viewed in color.
Flower 102
PFID 61
55.10
55.20
9.20
OM [28]c
28.20
LLC-SIFT
57.75
44.63 4.00
LLC-distance
59.76
48.45 3.58
LLC-combine
61.45
48.27 3.59
LSA-SIFT
57.80
43.35 3.36
LSA-distance
58.78
46.90 3.47
LSA-combine
60.38
46.54 3.08
of PFID 61.
c The Orientation and Midpoint (OM), as one of a set of methods based on
the statistics of pairwise local features proposed by Yang et al., yields the
best accuracy, where the 2 kernel is adopted with SVM.
545
TABLE II
C LASSIFICATION A CCURACY (%) C OMPARISON ON
T WO S CENE D ATASETS Scene 15 AND Indoor 67
Methods
Scene 15
Indoor 67
26.50
80.90
37.60
KSPM [7]
81.40 0.50
ScSPM [1]
80.28 0.93
84.10 0.50
77.00
LLC-SIFT
79.81 0.35
43.78
LLC-distance
80.30 0.62
43.53
LLC-combine
82.40 0.35
46.28
LSA-SIFT
80.12 0.60
44.19
LSA-distance
79.73 0.70
42.04
LSA-combine
82.50 0.47
46.69
NBNN [13]d
a The baseline result provided by the authors of Indoor 67, where the
546
Fig. 9.
Example images of Caltech 101 and Caltech 256 data sets
containing 102 and 257 categories, respectively. Besides object categories,
each of both data sets contains one extra background category, namely,
BACKGROUND_Google for Caltech 101 and clutter for Caltech 256. All
categories in two datasets have large object variations with cluttered background. Compared with Caltech 101, Caltech 256 has a more irregular object
layout, which may degrade the classification performance due to the imperfect
matching of spatial pooling. Image best viewed in color.
TABLE III
C LASSIFICATION A CCURACY (%) C OMPARISON
ON
Methods
Caltech 101
Caltech 256
SVM-KNN [32]
66.20 0.50
64.60 0.80
34.10
ScSPM [1]
73.20 0.54
34.02 0.35
71.50 1.10
35.74 0.10
70.40
37.00
LLC [2]c
73.44
41.19
LSA [10]
74.21 0.81
LLC-SIFT
72.65 0.33
36.27 0.27
LLC-distance
73.34 0.95
37.40 0.07
LLC-combine
74.59 0.54
38.41 0.11
LSA-SIFT
72.86 0.33
36.52 0.26
LScSPM [16]
LSA-distance
71.45 0.87
36.30 0.06
LSA-combine
74.47 0.46
38.25 0.08
a For fair comparison, the result of basic features with linear kernel is shown
here. Higher accuracy is also reported in [31], but where the intersection
kernel is employed.
b Performance of the original NBNN [8] provided in [2].
c LLC adopts three-scale SIFT features and the global dictionary of size 4096,
which can yield higher accuracy than single scale features, especially for
Caltech 256 with larger scale variation.
Classification Accuracy
80.00%
75.00%
70.00%
65.00%
60.00%
55.00%
50.00%
45.00%
40.00%
35.00%
d
knn
547
1) Sift: For the selected three datasets, the optimal parac = 2 for Flower 102,
meter is quite different, e.g., knn
c
while knn = 5 for Indoor 67 and Caltech 101. This may
be caused by the dependence of the optimal parameter
value on the interference of cluttered background. In
particular, the images in Flower 102 are all segmented,
which can significantly reduces the influence of background and a small neighborhood is sufficient.
2) Distance: The distance vector possesses different semantic from the original local feature introduced by our
proposed transformation. Compared with SIFT, the per-
548
[32] H. Zhang, A. C. Berg, M. Maire, and J. Malik, SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006,
pp. 21262136.
[33] O. Duchenne, A. Joulin, and J. Ponce, A graph-matching kernel
for object categorization, in Proc. Int. Conf. Comput. Vis., vol. 5.
Barcelona, Spain, Nov. 2011, pp. 17921799.