You are on page 1of 12


2, FEBRUARY 2013


Linear Distance Coding for Image Classification

Zilei Wang, Jiashi Feng, Shuicheng Yan, Senior Member, IEEE, and Hongsheng Xi

Abstract The feature coding-pooling framework is shown

to perform well in image classification tasks, because it can
generate discriminative and robust image representations. The
unavoidable information loss incurred by feature quantization
in the coding process and the undesired dependence of pooling
on the image spatial layout, however, may severely limit the
classification. In this paper, we propose a linear distance coding
(LDC) method to capture the discriminative information lost in
traditional coding methods while simultaneously alleviating the
dependence of pooling on the image spatial layout. The core of
the LDC lies in transforming local features of an image into
more discriminative distance vectors, where the robust imageto-class distance is employed. These distance vectors are further
encoded into sparse codes to capture the salient features of the
image. The LDC is theoretically and experimentally shown to
be complementary to the traditional coding methods, and thus
their combination can achieve higher classification accuracy. We
demonstrate the effectiveness of LDC on six data sets, two of
each of three types (specific object, scene, and general object),
i.e., Flower 102 and PFID 61, Scene 15 and Indoor 67, Caltech 101
and Caltech 256. The results show that our method generally
outperforms the traditional coding methods, and achieves or is
comparable to the state-of-the-art performance on these data sets.
Index Terms Image classification, image-to-class distance,
linear distance coding (LDC).


ENERATING compact, discriminative and robust image

representations is undoubtedly critical to image classification [1], [2]. Recently, several local features, e.g., SIFT [3]
and HOG [4], are quite popular in representing images due
to their ability to capture distinctive details of the images.
However, the local features are rarely directly fed into image
classifiers due to the computational complexity and their
sensitiveness to noise. A common strategy is to integrate the
local features into a global image representation at first. To this
end, various methods [1], [2], [5], [6] have been proposed,
Manuscript received February 16, 2012; revised August 30, 2012; accepted
August 30, 2012. Date of publication September 13, 2012; date of current
version January 10, 2013. This work was supported in part by the National
Natural Science Foundation of China under Grant 61203256 and the Singapore
Ministry of Education under Grant MOE2010-T2-1-087. The associate editor
coordinating the review of this manuscript and approving it for publication was
Prof. Erhardt Barth.
Z. Wang is with the Department of Automation, University of Science
and Technology of China (USTC), Hefei 230027, China, and also with the
Department of Electrical and Computer Engineering, National University of
Singapore, 117576 Singapore (e-mail:
J. Feng and S. Yan are with the Department of Electrical and Computer
Engineering, National University of Singapore, 117576 Singapore (e-mail:;
H. Xi is with the School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China (e-mail:
Color versions of one or more of the figures in this paper are available
online at
Digital Object Identifier 10.1109/TIP.2012.2218826

among which the Bag of Words (BoW) based ones [1], [2],
[5] present outstanding simplicity and effectiveness.
BoW image representation is typically generated via following three steps: 1) extract local features of an image on
the interest points; 2) generate a dictionary/codebook and then
quantize/encode the local features into codes accordingly; and
3) pool all the codes together to generate the global image
representation. Such a process can be summarized as a feature
extraction-coding-pooling pipeline. And it has been widely
used in recent image classification methods and achieves
impressive performance [1], [2], [7].
Within the above framework, the coding process will
inevitably introduce information loss due to the feature quantization. Such undesirable information loss severely damages
the discriminative power of the generated image representation and thus decreases the image classification performance.
Therefore, various coding methods are proposed to more
accurately encode local features with less information loss.
Most of these methods are developed from the Vector Quantization (VQ) which conducts hard assignment in the coding
process [5]. In spite of great simplicity, its inherent large
coding error1 often leads to unrecoverable loss of discriminative information and severely limits the classification performance [8]. To alleviate this issue, various coding methods
have been proposed. For example, soft-assignment [6], [9],
[10] estimates memberships of each local feature to multiple visual words instead of a single one. Another modified
method is Super Vector (SV) coding [11], which additionally
incorporates the difference between local feature and selected
visual word. Thus SV captures the higher-order information
and shows improved performance.
Though many coding methods [1], [2], [10], [11] are
proposed to accurately represent the input features, the information loss in the feature quantization for coding is still
inevitable. In fact, Boiman et al. [8] have pointed out that the
local features from long-tail distribution are inherently inappropriate for quantization, and the lost information in feature
quantization is quite important for good image classification
performance. To tackle this issue, the Naive Bayes Nearest
Neighbor (NBNN) method is proposed to avoid the feature
coding process, by employing the image-to-class distance
for image classification [8]. Benefiting from alleviating the
information loss, NBNN is able to achieve competitive classification performance on multiple datasets with coding based
methods. Motivated by its success, several methods [12][14]
are developed to further improve the NBNN. However, all
variants of NBNN practically employ uniform summation to
aggregate image-to-class distances calculated based on local
1 Or called the coding residual, which refers to the difference between
original local feature and the reconstructed feature from the produced codes.

10577149/$31.00 2012 IEEE


features. This introduces two inherent drawbacks, namely they

are sensitive to noisy features and easy to be dominated by
outlier features.
In essence, the BoW-based methods and the NBNN-based
methods are using different visual characteristic statistics to
perform image classification. The former depends on salient
features of an image, while the latter equally treats all the local
features. In addition, the NBNN ones replace the image-level
similarities with the image-to-class distance on performing
classification in order to generate more robust results. Therefore, the BoW and NBNN based methods may be suitable
for different types of images. For example, for the images
with cluttered background, the BoW based ones show better
classification performance due to its ability to capture the
salient features. Therefore, it is reasonable to propose that
if we can combine the advantages of both of them, namely
capturing the saliency of images without information loss, the
classification performance can be improved further.
Besides reducing the information loss of feature coding,
how to more effectively explore spatial context is also crucial
for achieving good classification performance. In most of
the coding-pooling based methods, Spatial Pyramid Matching
(SPM) [7] has been widely adopted in the pooling procedure
due to its effectiveness and simplicity. However, SPM strictly
requires the involved images to present similar spatial layout
to ensure that the generated image representations can match
well in element-wise manner [15]. This requirement originates
from the fact that the used local features are often representing
the object-specific visual patterns. However, such requirement
has negative effect on classification accuracy because realistic
images usually show various spatial layout even within the
same category. Alternatively, if the elements of adopted features can be transformed to bear the class-specific semantic,
such requirement would be greatly relieved.
In this paper, we propose a novel Linear Distance Coding
(LDC) method to simultaneously inherit the nice properties
of BoW and NBNN and meanwhile relieve the image spatial
alignment requirement of SPM. LDC also works under the
feature extraction-coding-pooling framework, i.e., it generates
the image representations from the salient characteristic local
features for the classification, as shown in Figure 1. The
proposed LDC particularly focuses on utilizing the discriminative information lost by the traditional coding methods and
more effectively exploiting the spatial information. In practice,
LDC transforms each local feature into a distance vector,
which is an alternative discriminative pattern of local feature,
in the class-manifold coordinate system. Compared with the
original local features, each element of the distance vectors
represents certain class-specific semantic which consists of the
distances of local features to class-specific manifolds. Thus
the strict requirement of image layout similarity in original
SPM can be effectively relieved, since the embedded class
semantic in each feature element robustifies the similarity
calculation between the objects posing differently, as detailed
later. Comprehensive experiments on various types of datasets
consistently show that the image representation produced by
LDC achieve better or competitive performance compared
with state-of-the-arts. Furthermore, the image representations





Local features

Class 1 Class 2

Class K

Class Manifolds

Distance to Class Manifold


Coding &


Manifold Coordinate System

Fig. 1. Illustration of linear distance coding. The local features extracted from
various classes of training images are first used to generate a manifold for each
class that is represented by a set of local features (i.e., anchor points). Based
on the obtained class manifolds, the local feature xi is transformed into a more
discriminative distance vector di = [di,1 , di,2 , . . . , di,K ]T , where K denotes
the class number. On these transformed distance vectors, the linear coding
and max-pooling are performed to produce the final image representation.
The principle of the distance transformation from original local feature xi
to distance feature di is to form a class-manifold coordinate system with
the K obtained class manifolds, where each class corresponds to one axis.
For the kth class manifold M k , the coordinate value di,k of local feature xi
corresponds to the distance between xi and this class manifold. Image best
viewed in color.

produced by LDC are proven to be complementary to the ones

from the original coding methods. Thus their combination,
even a direct concatenation of resulting image representations,
can yield remarkable performance improvement as expected.
The main contributions of this work can be summarized as
1) We propose a novel distance pattern of local features
through constructing the class-manifold coordinate system. The produced distance vectors are quite discriminative and is able to relieve the strict requirement of SPM
on image spatial layout, benefiting from the adopted
more robust image-to-class distance.
2) We propose a linear distance coding (LDC) method,
which conducts the linear coding and max-pooling on
the transformed distance vectors to elegantly aggregate
the salient features of images. Compared with the NBNN
methods, such process can avoid the undesired case
where the discriminative features are dominated by
outlier or noisy features, especially for the images with
cluttered background.
3) From both theoretical analysis and experimental verification, the image representations produced by LDC are
complementary to the one from the traditional coding
method. And their combination is shown to outperform
each individual of them and achieve the state-of-the-art
performance on various benchmark datasets.
This paper is organized as follows. Section II introduces the related works, including the linear coding models


and the NBNN methods. Section III proposes the distance

pattern by introducing the class-manifold coordination system.
Section IV applies the linear coding and max-pooling on the
transformed distance vectors, and the combination of LDC and
the original coding method is discussed. The experiments on
three types of datasets are presented in Section V, meanwhile
the sensitiveness of the key parameters to classification performance is also discussed. Finally, Section VI concludes this
The proposed Linear Distance Coding (LDC) utilizes simultaneously the linear coding methods and the image-to-class
distance adopted in NBNN [8]. In this section, we briefly
discuss the conventional coding methods and the NBNN
1) Linear Coding Models: Linear coding is to approximate
the input feature by a linear combination of the basis in a
given dictionary. Through the coding process, input features
are transformed into more discriminative codes. The popular
linear coding models include Vector Quantization (VQ) [5],
Soft-assignment Coding [6], Sparse Coding (SC) [1], Localityconstrained Linear Coding (LLC) [2] and their variants [16].
Given a dictionary B = [b1 , b2 , . . . , b p ] Rd p consisting
of p basis features with dimensionality d, linear coding computes a reconstruction coefficient vector v R p to represent
the input feature x Rd by minimizing the following loss
L(v) = x Bv22 + R(v)
where the first term measures the approximation error and
the second one serves as regularization. In fact, existing
coding models mainly differ from each other at imposing
different prior structures on the generated code v via a specific
regularization R().
In particular, LLC [2] considers that locality is more essential than sparsity for the feature coding. It adopts a locality
adaptor in the regularization R() to replace the 1 -norm
used in SC. The locality regularization takes into account
the underlying manifold structure of local features and thus
ensures good approximation. Inspired by LLC, Liu et al. [10]
propose to inject locality into the soft-assignment coding and
devise the Localized Soft-Assignment (LSA) coding method.
For any local feature x, its membership estimation is restricted
to only certain number of nearest basis in the dictionary. LSA
discards the possibly unreliable interpretations from distant
basis and obtains more accurate posteriori probability estimation. However, the accuracy of such posteriori estimation
(i.e., coding result) heavily depends on the size of the adopted
dictionary and the underlying distribution of local features,
which determine the performance of image classification.
Inspecting the feature coding in (1), the information loss
may originate from two aspects. The first one is the inaccurate linear approximation and the imperfectness of the
dictionary B. The second one is that the enforced structure
in R() can only be achieved by sacrificing some approximation accuracy. In linear coding models which operate on
the original local features, such information loss is inevitable.


However, the lost information is probably quite important for

accurate image classification [8].
2) NBNN Methods: The Naive Bayes Nearest Neighbor
(NBNN) [8] is essentially a non-parametric classification
method without a training phase, where the classification is
performed based on the summation of Euclidean distances
between local features of the test image and reference classes
(i.e., image-to-class distance) [8], [12][14]. By avoiding the
feature coding, the NBNN effectively reduces the information
loss and thus achieves competitive classification performance
on multiple benchmark datasets.
In the NBNN methods, all local features from the same class
are assumed to be i.i.d. sampled from a certain class-specific
distribution, and thus image classification is equivalent to a
maximum likelihood estimation problem [8]:

c = arg max p(c|Q) = arg max


where c denotes the class, and Q denotes all the descriptors

of the query image. In particular, the NBNN estimates the
likelihood probability through a set of Parzen kernel functions
(typically Gaussian kernel function):

c 2

exp 2 x x j 
j =1

is the j -th nearest neighbor on the class c, is

the bandwidth of kernel function, L is a normalization factor,
and r denotes the number of nearest neighbors. In NBNN,
the case of r = 1 is particularly used due to its simplicity and
interpretability. Under such case, the resulting NBNN criterion
is simplified to:

c = arg min


xi xic 22



where xic is the nearest neighbor of xi on the class c, and N is

the number of local features. The original NBNN method [8]
equally and independently treats local features and classes via
the summation in (4), which causes the sensitiveness to the
noisy features and outliers. Consequently, the classification
performance cannot be greatly improved although more robust
image-to-class distance is adopted.
More specifically, the original NBNN algorithm suffers
from the following three drawbacks: 1) the spatial information [7] is not fully exploited, which however is shown to
be quite useful for image classification; 2) the computational
complexity rapidly increases with the number of local features
and thus the scalability is severely limited. In particular, the
time complexity for one query image with N features is
O(N N D log N D ), where N D is the number of all local features
of the training images [8]; and 3) it equally treats all classes for
any local feature of testing images, and consequently can not
adapt to the involved dataset and capture the image saliency
well, as discussed above.
To alleviate these issues, various modified methods have
been proposed, such as using class-specific Mahalanobis
metric instead of Euclidean distance [13], associating classspecific parameters for each class [12] and kernelizing the



NBNN [14]. These modified NBNN methods [12][14] share

two features although they seem to be quite different. First, all
of them use the same strategy to improve the classification performance, namely enhancing the adaptiveness of the resultant
metrics by learning some key parameters. In fact, such learning
process is an alternative of training parametric models on the
training samples. Second, the final classification criterion is
always reduced to the summation of certain distance of all
local features within each image, no matter what distance
metric is adopted. Such uniformly summing operation usually
renders the generated metric sensitive to noisy points as
aforementioned. Consequently, the individual NBNN cannot
outperform the feature coding based methods in the image
classification tasks.
In this work, we focus on solving the image classification
problem formally stated as follows: given a set of local
features Xi and the class label yi of the i -th image Ii , we
want to learn a classifier from local features to image label
C: Xi  yi such that classification error can be minimized
w.r.t. both the training and test images. In particular, we aim
at a method generating more discriminative image representations from Xi for better classification performance. Here we
propose a novel coding method which preserves the superior
discriminative capability and robustness of the feature coding
based methods [2], and meanwhile effectively captures the lost
information in the previous coding methods. In the following,
we first introduce the proposed desired distance pattern which
is more discriminative and robust.
A. Class-Specific Distance
Using the distance between local feature and certain class to
estimate image membership can provide better generalization
capability. Such class-specific distance is fundamental to the
NBNN methods and crucial to achieve outstanding classification performance [8]. In particular, all of the existing NBNN
methods approximate the class-specific distance by calculating
the distances between the local feature and its corresponding
nearest neighbor retrieved in the reference images [8]. Formally, let d(xi , c) denote the distance between a local feature
xi and the class c. Here the class c consists of a set of
local features {xcj } all of which are extracted from the training
images from c. Then d(xi , c) is computed as
d(xi , c) = minc xi x22 = xi xic 22
x{x j }


where xic denotes the mapped point of xi in class c and reduces

to the nearest neighbor of xi in the NBNN methods. However,
the derived distance in Equation (5) suffers from the following
1) It is quite sensitive to noisy features in the training
set {xcj }. Local feature is prone to change significantly
even under slight appearance variation and this causes
ubiquitous noisy features. In the presence of noisy
features or outliers in {xcj }, the estimated distance of
local features in the testing image may severely deviate

from the correct one because of the fragile quadratic

criterion. This may lead to quite unreliable distance
pattern and consequently degrade the performance of the
classification criterion based on such distance pattern.
2) It is highly computationally expensive to find the nearest
neighbor for each query feature as aforementioned. The
computational complexity O(N N D log N D ) is proportionally increasing with the number of local features
in the training set. In practice, many works extract a
huge number of local features which heavily limits the
efficiency of NBNN based methods. Although there are
some accelerated algorithms [17], [18], the low efficiency is still a bottleneck of such distance calculation.
To alleviate these issues, we propose a novel algorithm to
calculate the distance d(xi , c). The essential idea here is to
calculate a more appropriate mapping point xic rather than to
simply find the nearest neighbor as in NBNN. The new xic is
allowed to be a virtual local feature in the class c. In particular,
we assume the local features of each class are sampled from
a class-specific manifold M c , which is completely determined
by available local features of the corresponding class {mic }ni=1
And such features are called anchor points [19], which
can be obtained through clustering the local features from
class c. Here the manifold of class c is denoted as M c =
[m1c , m2c , . . . , mnc c ]. Then the computational complexity of a
single input image with N features becomes O(Nn c log(n c ))
with n c  N D , where N D is the number of all training
local features. For example, in our following experiment, there
are about 60 000 local features for each class with 2000
features per image and 30 training images. After the clustering
preprocessing, only n c = 1024  60 000 anchor points are
used to describe the manifold. In addition to reducing the
complexity, using the cluster centers as anchor points can
effectively reduce the influence of noisy features and thus
produce a more robust description for the manifold. This is
established under the reasonable assumption that the fraction
of outliers is small, and the resultant centers are mainly
determined by the dominant inlier features.
Now we present an efficient algorithm to determine the
good mapping point xic , even when relatively few anchor
points are provided. By utilizing the locally linear structure of
the manifold, xic can be calculated through the locally linear
regression method. More specifically, xic is computed as a
linear combination of its neighboring anchors in the manifold
M c . Here we apply an approximate fast solution of LLC [2] to
our problem, which only selects the fixed number of nearest
neighbors and can be formulated as follows:
minxic M c vi 22

subject to : v i, j = 0
1T vi = 1,

if mcj
/ Nik


where vi = [v i,1 , v i,2 , . . . , v i,nc ]T is the linear representation

coefficients of xi on the manifold M c , and Nik is the set
of k nearest neighbors of xi . Substitute the resultant xic
derived from (6) into (5), the distance d(xi , c) will be finally
obtained, which is denoted as dic . Such class-specific distance



is motivated by capturing the underlying manifold structure of

the local features and computed in a robust linear regression
way. Thus it gains stronger discriminative power and more
robustness to noisy and outlier features.
B. What is Good Distance Pattern?
[di1 , di2 , . . . , diK ]T R K denote the
local feature xi , which aggregates its

Let di =
vector of the
relationship to all K classes. In contrast to original local
features (e.g., SIFT), which describe the appearance patterns of
characteristic object, the distance vector represents a relative
pattern that captures the discriminative part of local features
w.r.t. specified classes, i.e., it is more class-specific as we
desired. In fact, the distance vector is the projection residue of
local features onto the class manifolds, as shown in Figure 1.
Note that in the figure each axis denotes one class manifold.
Through such residue-pursuit feature transformation, the distance vector gains the following advantages compared with
original local features:
1) The distance vector preserves the discriminative information of local features lost in the traditional feature
coding process.
2) The distance vector can coordinate better with the
additional operation to explore useful spatial information, e.g., SPM. The spatial pooling of traditional local
features requires the involved images have similar object
layout such that the resulting representations of different
images can be well element-wisely matched. Such over
strict requirement is significantly relieved by the distance
vector because of the class-specific characteristic of the
adopted image-to-class distance, as shown in Figure 2.
Compared with previous NBNN methods which directly
sum up the image-class distance for classification, here we
propose to use the distance vector as a new kind of local
feature. Thus, any classification model used on the original
local features can perfectly fit for the distance vector.
Before providing more robust and discriminative distance
pattern, we first recall the original NBNN strategy for image
classification. Given an image I with N local features xi ,
the distance vectors di R K are calculated as in (5). Then
the estimated category c of I is determined by the following


c = arg min
di = arg min
di,1 ,
di,2 , . . . ,




where k is the index of element corresponding to the category.
Namely, the original NBNN method just separately considers
the element-wise semantic of the obtained distance vector,
and completely ignores the intrinsic pattern described by the
distance vector.
Different from the previous methods, we regard each
distance vector as an integral feature, and then apply the
outperforming coding model on these transformed features.
In particular, the final used distance pattern in our method

representaon space


in Class 1

Image 1

Image 2

Class 1

Class 2

Original local features

Class 1

Class 2

Distance features

Fig. 2. Schematic diagram of the distance pattern relieving the requirement of

layout similarity. In the original feature space, each class has multiple clusters
of characteristic features. When the images involved have different layouts,
the resulting image representations may be quite different due to the features
contained by the same SPM grid of different images being different. This
has a negative impact on the usual element-wise matching-based methods to
achieve high classification accuracy. But such an undesired situation can be
significantly resolved by our proposed distance transformation, as all distance
vectors within the same class turn out to be more similar in the distance
feature space, benefitting from the class-specific characteristic of the adopted
image-to-class distance. Consequently, image representations of the same
class become closer to each other in the image level representation space,
even though they show a totally different layout (e.g., the distance image
d and vd in class 1). Different shapes represent different
representations vI
classes in certain feature spaces and different color indicates different features
(e.g., the pink rectangles represent the indistinctive features in class 1, lying
close to class 2). Image best viewed in color.

admits the following form:

di = di min(di ),

d i = f n (di ) = 
[di,1 , di,2 , . . . , di,K ]T
di 2


where f n () is the normalization function with 2 -norm. From

Equation (8), the used d i mainly represents the distance
pattern with d i 2 = 1. In practice, compared with the direct
normalization of f n (di ) without the minimum subtraction, it is
experimentally shown that the normalization in (8) produces
a slightly higher classification accuracy [14], which may be
benefitted from the increased gap between elements for more
discriminatively describing features. For simplicity, we would
use di to refer to d i if without ambiguity in the following
sections. Finally, we summarize the procedure to compute the
adopted distance pattern in Algorithm 1.
Here we explore how to utilize the obtained distance vectors
to produce discriminative and robust image representation.
Different from the previous NBNN-like methods, we aggregate the obtained distance pattern under the coding-pooling
framework which provides state-of-the-art performance in the
previous works. The overview of the image classification
flowchart is shown in Figure 3. The distance vectors are
transformed from local features one by one, then the distance
vector and the original local feature are separately encoded
and pooled to generate two image representations vId and vI .



Algorithm 1: Distance Pattern

Data: N local features {xi }i=1
of image I, the
class-specified manifolds M c , c = 1, 2, . . . , K .
Result: The desired distance vectors d i , i = 1, 2, . . . , N.
for i 1; i N; i i + 1 do
for k 1; k K ; k k + 1 do
calculate vi using (6), then di,k = xi M k vi 22 .
Construct the distance vector
di = [di,1 , di,2 , . . . , di,K ]T .
Obtain the normalized distance vector d i from (8).

Local Feature







Fig. 4. Illustration of the complementary between image representations

produced by the LLC-like coding methods and our LDC method. In the
coding-pooling framework, the original local feature x are approximated
by the fixed visual words (anchor points) and the corresponding code v.
Here we specially suppose the anchor points of all classes to form a
fixed global dictionary B = [M 1 , M 2 , . . . , M K ] by concatenating them.
Then the original information of the original feature x can be completely
expressed by the generated codes v = [v1 , v2 , . . . , v K ]T and the residue
error [n1 , n2 , . . . , n K ]T . In fact, the proposed LDC is to utilize the residue
error information by compressing nk into dk with dk = nk 2 . Therefore,

Fig. 3. Overview of the image classification flowchart. This architecture
has been proven to achieve state-of-the-art performance on the basis of a
single type of feature, e.g., LLC [2]. (a) Linear coding and max-pooling
are sequentially performed on original extracted local features, resulting
in an original image representation. (b) All local features are transformed
into distance vectors, on which the linear coding and max-pooling are
sequentially performed. This coding process is called LDC in this paper,
and it results in a distance image representation. Finally, the original image
representation and the distance image representation are simply concatenated
so that they complement each other, where linear SVM is adopted for the
final classification.

Finally, the linear SVM is adopted to classify the images

based on individual image representation, or their concatenated
image representation.
To verify the effectiveness and generalization of such distance transformation, we apply two different coding models
independently, i.e., LLC [2] and Localized Soft-Assignment
coding (LSA) [10], to encode distance vectors due to their
high efficiency provided by the approximate fast solution. We
particularly illustrate this procedure via LLC2 . Let B R K P
be the distance dictionary consisting of P distance vectors
b1 , b2 , . . . , b P , which can be obtained by k-means clustering
from the obtained distance vectors of training images. For
the input distance vector di , the corresponding code yi is
calculated as follows [2]

min di Byi 22 + ei yi 22 ,

d are complementary to each other due

the image representations vI and vI
to their complementary perspectives on utilizing the original information.

where max is performed element-wisely for the involved

vectors. In addition, SPM with three levels is adopted for
the spatial pooling. Thus, the distance image representation
vId is equivalently compact, salient, and discriminative as the
original image representation vI .
Here we provide brief analysis on the relationship of the
original image representation vI and the distance image
representation vId . The most intuitive difference is that they
are derived from two different local features: the original
local features {xi } and the distance vector {di }, respectively.
For individual point within images, the coding quantization
on original local features inevitably loses some important
information more or less due to only preserving the principal
information, while the distance vector captures the discriminative information in the residue part and thus compensate the
information loss, as shown in Figure 4. So it is creditable that
the resulting image representations vI and vId are complementary to each other. In practice, we simply concatenate vI and
vId to form a longer vector vIc , which is expected to achieve
better performance. The benefit of such complementarity is
well verified by the following experiments on multiple types
of benchmark datasets.


subject to : 1T yi = 1,


where denotes the element-wise multiplication, 1 is a

P-dimensional all-1 vector, and ei R P is the locality adaptor
that gives different freedom for each visual word proportional
to its similarity to the input distance feature di .
After linear coding on the distance vectors, the max-pooling
is performed on the obtained sparse codes {yi } to produce the
distance image representation vId for image I, namely,
vId = max(y1 , y2 , . . . , y N )
2 The counterpart of LSA refers to [10] for details.


In this section, we evaluate the performance of the proposed
method on three groups of benchmark datasets: specific objects
(e.g., flower, food), scene and general objects. In particular, the
specific object datasets include Flower 102 [20] and PFID
61 [21], in which the images are relatively clean without
cluttered background. The scene datasets include Scene 15 [7]
and Indoor 67 [22]. And the general object datasets include
Caltech 101 [23] and Caltech 256 [24].
Among various feature coding models producing relatively
compact image representations, Locality-constrained Linear
Coding (LLC) and Localized Soft-Assignment Coding (LSA)


almost always achieves the state-of-the-art classification

performance [2], [10]. In addition, they, compared with
ScSPM and other similar methods, have much lower computational complexity owing to existed fast solution [2]. Thus
we adopt LLC and LSA individually as the coding model in
our method, where the max-pooling is always employed. Of
course, similar coding models can also be naturally applied
on the transformed distance features, e.g., Laplacian Sparse
Coding (LSCSPM) proposed by Gao et al. in [16]. And the
main target of the following experiments is to verify the
uniform effectiveness of the proposed distance pattern on
improving classification performance. Moreover, we adopt the
best performance of the comparable methods ever reported on
each dataset and the achieved accuracies of LLC and LSA as
the baselines in the performance evaluation. Before reporting
the detailed classification results on these datasets, we first
give the experimental settings.
A. Experimental Settings
For fair comparison with ever reported results, local features
of a single type, dense SIFT [3], are used throughout the
experiments. In all of our experiments, SIFT features are
extracted at single-scale from densely located patches of gray
images. The patches are centered at every 4 pixels and of
the fixed size as 16 16 pixels, where the VLFeat lib [25] is
used. Before feature extraction, all the images are resized with
reserved aspect ratio to no more than 300 300 pixels. The
anchor points {mic } of each class manifold M c are learned from
the training images of that class, and their number is fixed as
K c = 1024 for all classes throughout our experiments. For the
original dense SIFT features, and the corresponding distance
vectors, the global dictionaries containing P visual words are
learned individually from all training samples via k-means
clustering. In particular, P = 2048 is fixed for all datasets.
Each SIFT feature xi or distance vector di is normalized by
its 2 -norm and then encoded into a P-dimensional vector.
An important parameter of LLC and LSA is the number
c on encoding local features. In our
of nearest neighbors knn
method, the distance vector is similarly calculated based
d neighbors in specified class manifold. For reducing
on knn
their influence to classification performance, four different
values are used individually for these parameters, i.e., knn
{1, 2, 3, 4}, and knn {2, 5, 10, 20} as suggested in LLC [2].
In experiments, we report the best result for each method
under these parameters, and the influence of these parameters
is discussed in the following subsection. In addition, the
bandwidth parameter of LSA is fixed as 10, as the authors
setting in [10].
In the experiments, the SPM is used by hierarchically partitioning each image into 11, 22, and 44 blocks on 3 levels,
whose cumulative concatenations are denoted by SPM0, SPM1
and SPM2, respectively. In particular, SPM2 means that all
three levels (from 0 to 2) are used by concatenating their
pooling vectors. All obtained image-level representations are
fed into the linear SVM in the training and testing phases
(the libLinear package [26]), where the penalty parameter of
SVM is fixed as C = 1. Actually we found the classification




Fig. 5. Example images of Flower 102 dataset, where each row represents one
category. (a) Original images. (b) Corresponding segmented images. Limited
by the performance of the segmentation algorithm, the segmented images may
contain part of the background, lose part of the object, or even lose the whole
object. Image best viewed in color.

performance is quite stable for different penalty parameter

values. The number of repeatitions and the number of training
and testing samples follow the provided configuration along
with each dataset. The performance is measured by the average
classification accuracy on all classes. For multiple runs, both
the mean and the standard deviation of the classification
accuracy are reported.
As for the evaluations of the proposed methods, we report
the results of three different image-level representations: the
original feature representation vI , the distance image representation vId , and their direct concatenation vIc . In the
experimental results, LLC and LSA is assembled separately
with different input features. For example, LLC-SIFT refers
to applying LLC on the original SIFT features to produce
the image level representation, and LLC-Combine refers to
the result of the concatenated image representations from
LLC-SIFT and LLC-Distance.
B. Specific Object Datasets
We first evaluate the proposed method on the
Flower 102 [20] and PFID 61 [21] datasets, whose images
are relatively clean and the background is less cluttered.
1) Flower 102: Flower 102 is a 102 category flower
dataset [20], containing 8189 images. And each class consists
of 40 to 258 images. Some examples are shown in Figure 5.
In particular, the images possess small inter-class difference
and large intra-class variance. Here we focus on classifying the
segmented images available from the dataset. Limited by the
imperfectness of the segmentation algorithm, the segmented
foreground may contain part of background, or lose part of
object. Therefore, it is still challenging for the classification
method on such segmented images. The dataset has been
divided into a training set, a validation set and a testing set
in the provided protocol. The training set and validation set
consist of 10 images per class. And the testing set consists of
the remaining 6149 images (minimum 20 per class).
2) PFID 61: Pittsburgh Fast-Food Image Dataset is a
collection of fast food images from 13 chain restaurants
(e.g., McDonald, Pizza Hut, KFC) acquired under lab and




Fig. 6. Example images of PFID 61 dataset, where each row of the left
and right part represents one category. Each category contains three instances
and each instance has six images from different views. Two images of each
instance are shown here. Image best viewed in color.

realistic settings [21]. It contains 61 categories of food items

selected from 101 categories. There are 3 instances of each
food item, each of which are bought from different branches
and taken on different days. And 6 images from 6 viewpoints
(60 degrees apart) for each food instance. Figure 6 shows
14 categories of them with two example images per category.
It is notable that the appearance of different instances in
each category vary greatly. And some different categories
(e.g., Hamburgers) are too similar to distinguish them even
by the human eyes. Such large instance variance and tiny
difference between classes make the classification quite challenging.
For Flower 102, most of the previous classification methods
employing single feature are based on the 2 kernel function
of the clustered SIFTint and SIFTbdy features [27]. In stark
contrast, we directly uses much simpler and more efficient
linear SVM to classify the segmented images. We directly
train the classifier on the training and validation images, as
used by the baseline method provided in [20]. Namely, 20
images per class are used for training, and the remaining are
used for testing. For PFID 61, we follow the experimental
protocol proposed in previous work [21], [28], and use 3-fold
cross-validation to evaluate the performance. In each iteration,
12 images of two instances are used for training and the
6 images of the third one are used for testing. We repeat
the training and testing process for 3 times, with a different
instance serving as the test set.
Table I gives the classification performances of different
methods on the datasets Flower 102 and PFID 61. Here
KMTJSRC-CG is the method proposed by Yuan et al. [27]
that uses multi-task joint sparse coding and achieves the stateof-the-art performance 55.20% on this dataset. As for PFID
61, the state-of-the-art performance is 28.20%. It is achieved
by Yang et al. [28] through utilizing the spatial relationship of
local features. Besides these methods, we perform the adopted
coding methods LLC and LSA on both datasets to demonstrate
the effectiveness of our proposed LDC on improving the
classification performance.
From Table I, it can be observed that the proposed method
significantly outperforms LLC and LSA with SIFT features
and generally achieves the state-of-the-art performance. This
well verifies that the proposed distance pattern of local features is able to more effectively capture the discriminative

Flower 102


SVM (SIFTint) [20]a




Bag of SIFT [21]b


OM [28]c




44.63 4.00



48.45 3.58



48.27 3.59



43.35 3.36



46.90 3.47



46.54 3.08

a The best baseline accuracy provided by the authors of Flower 102

for the single feature, which is based on SVM.

b One of baseline accuracies on the 61 categories provided by the authors

of PFID 61.
c The Orientation and Midpoint (OM), as one of a set of methods based on
the statistics of pairwise local features proposed by Yang et al., yields the
best accuracy, where the 2 kernel is adopted with SVM.

information among multiple classes. According to our

analysis, the combination of the distance vector and the original SIFT features should yield better classification accuracy
than using each of them individually. This is because the
combination is able to compensate the information loss and
provide more useful information. This point is well shown
on the dataset Flower 102, where the combination achieves
the best accuracy 61.45%. However, the effectiveness of such
combination does not hold on the dataset PFID 61, where
the individual distance vector achieves the best performance
48.45% rather than the combination. The reason is that different instances of PFID 61 possess too large variations, and
thus the consistency of local features distribution between the
training images and the testing images is not well guaranteed.
This is experimentally demonstrated by the larger accuracy
derivations from both LLC and LSA methods in Table I. In this
case, the combination may slightly overfit the training data and
lead to a negligible decrease of classification accuracy, e.g., the
average accuracy is decreased from 48.45% to 48.27% when
LLC-Distance is combined with LLC-SIFT.
C. Scene DataSets
Now we evaluate the proposed method on the scene datasets
Scene 15 and Indoor 67. The scene recognition is a challenging
open problem in high level vision because each image contains
not only the undeterminable characterizing objects but also the
complex background [22]. Compared with the object classification, the variations of images in the scene classification
are more severe, especially for the light condition, scale, and
spatial layout.
1) Scene 15: This dataset consists of 15 scene categories, among which 8 categories are originally collected by
Oliva et al. [29], 5 are added by Li et al. [5] and 2
are adopted from Lazebnik et al. [7]. Each class contains
200 to 400 images, and the average image size is around



T WO S CENE D ATASETS Scene 15 AND Indoor 67

Scene 15

Indoor 67




KSPM [7]

81.40 0.50

ScSPM [1]

80.28 0.93

SC + linear kernel [31]c

84.10 0.50



79.81 0.35



80.30 0.62



82.40 0.35



80.12 0.60



79.73 0.70



82.50 0.47


ROI + gist-annotation [22]a

Object Bank [30]b

Fig. 7. Example images of Scene 15 dataset containing all 15 categories

with two images per category.

NBNN [13]d

a The baseline result provided by the authors of Indoor 67, where the

Region of interest (ROI) detection is employed to reduce the interference

of clutter background and the RBF-kernel SVM is adopted.
b Object Bank pre-trains one object detector for each class.
c For comparison, the result of basic features is shown here, but it adopts
the intersection kernel rather than our adopted linear SVM.
d This is the optimized version of NBNN, where the image-to-class distance
is learned by employing the Mahalanobis metrics.

Fig. 8. Example images of Indoor 67 data set containing 67 categories.

All categories are organized into five big groups: Store, Home, Public spaces,
Leisure, and Working. Four categories with two images per category are shown
for each group. Due to the complex background, images within each category
vary widely. Image best viewed in color.

300 250 pixels. Figure 7 shows some example images of

each category.
all 15 categories with two images per category.
2) Indoor 67: This dataset contains 67 indoor scene categories, and a total of 15620 images [22]. The images in
the dataset were collected from three different sources: online
image search tools (Google and Altavista), online photo sharing sites (Flickr) and the LabelMe dataset. All images have
a minimum resolution of 200 pixels along the smaller axis.
The number of images varies across categories, but there
are at least 100 images per category. To facilitate seeing the
variety of different scene categories, they are organized into
5 big scene groups (Store, Home, Public spaces, Leisure, and
Working places), as shown in Figure 8.
For Scene 15, we follow the setting in [7] to randomly
choose 100 images per class for training and test on the rest. In
particular, we repeat the evaluation three times, then report the
average results and the standard deviation. As for Indoor 67,
we follow the settings of the baseline method provided in [22].
The 80 images of each class are used for training and 20
images for testing, whose partition is provided on the dataset
Table II provides the classification results on Scene 15 and
Indoor 67. In the table, several baseline results on these two
scene datasets are provided. The used methods include the
detection based methods, the linear coding methods, and the
NBNN method. For these two datasets, the distance vectors

yield classification performance close to the original local

features due to the relatively poor consistency on the feature
distribution of training and testing images. As expected, the
combination achieves the best performance for both LLC and
LSA methods, as the spatial robustness of the transformed distance vectors strengthens the robustness of the final combined
image level representation.
D. General Object Datasets
Here we conduct experiments on the datasets Caltech 101
and Caltech 256, in which each image contains certain object
and cluttered background. The Caltech 101 dataset [23] contains 9144 images in 101 object categories including animals,
vehicles, flowers, buildings, etc. The number of images per
category varies from 31 to 800. The Caltech 256 dataset [24]
contains 30, 607 images from 256 object categories and each
category contains at least 80 images. Besides the object
categories, both datasets are individually added to an extra
background class, i.e., BACKGROUND_Google and clutter,
respectively. Figure 9 gives some example images. Compared
with Caltech 101, Caltech 256 presents much greater variation
in object size, location, pose, etc.
For both datasets, we randomly select 30 images for training and test on the rest. In particular, we repeat it three
times and then report the average classification accuracy and
the corresponding standard deviation. Table III provides the
resultant classification performance on these two datasets.
Here we compare our method mainly with the linear coding
methods and the NBNN method. In particular, LLC in [2]
adopted three-scale SIFT features, while our work only uses
the single-scale SIFT features. For Caltech 256, LLC [2]
adopted a dictionary of 4096 visual words to further improve



Fig. 9.
Example images of Caltech 101 and Caltech 256 data sets
containing 102 and 257 categories, respectively. Besides object categories,
each of both data sets contains one extra background category, namely,
BACKGROUND_Google for Caltech 101 and clutter for Caltech 256. All
categories in two datasets have large object variations with cluttered background. Compared with Caltech 101, Caltech 256 has a more irregular object
layout, which may degrade the classification performance due to the imperfect
matching of spatial pooling. Image best viewed in color.

Caltech 101 AND Caltech 256


Caltech 101

Caltech 256

SVM-KNN [32]

66.20 0.50

KSPM [7], [24]

64.60 0.80


ScSPM [1]

73.20 0.54

34.02 0.35

SC + linear kernel [31]a

71.50 1.10

35.74 0.10

NBNN [2], [8]b



LLC [2]c



LSA [10]

74.21 0.81


72.65 0.33

36.27 0.27


73.34 0.95

37.40 0.07


74.59 0.54

38.41 0.11


72.86 0.33

36.52 0.26

LScSPM [16]


71.45 0.87

36.30 0.06


74.47 0.46

38.25 0.08

a For fair comparison, the result of basic features with linear kernel is shown
here. Higher accuracy is also reported in [31], but where the intersection
kernel is employed.
b Performance of the original NBNN [8] provided in [2].
c LLC adopts three-scale SIFT features and the global dictionary of size 4096,
which can yield higher accuracy than single scale features, especially for
Caltech 256 with larger scale variation.

the performance, and our used dictionary of size fixed as

2048. However, even following the same setting for Caltech
101 dataset, the results by ourselves are slightly worse than
the reported ones in the previous literatures. It is similar for
LSA. Such decrease may be introduced by some implementing
details. For the fair comparison, here we only compare the
results from our own implementation.
Comparing the results in Table III, we can observe that the
combination of the distance vector and the original features

always yields better performance than individual one, as

expected. Compared with the previous methods, our method
achieves the satisfying performance and outperforms the similar methods with linear SVM and single feature. Actually,
the classification accuracy can be further increased if some
advanced learning-based model [15] or graph-matching kernel [33] is adopted with neglecting their complications.
From the above experimental results on several different
types of image datasets, we can summarize the effectiveness
of the proposed method as follows:
1) The distance vectors are quite discriminative under mild
condition that the distributions of the training data and
the testing data are consistent to some extent, e.g.,
the involved images have less interference of cluttered
2) The transformation to the distance vector relaxes the
requirement for the similarity of object spatial layout
due to its independence on spatial position of distinctive
objects. This is one of the critical differences from the
original local features.
3) Under the coding-pooling framework, the distance
vector and the original feature are complementary to
each other. Consequently, their combination can more
comprehensively capture the useful classification information and generally achieves higher classification performance, which is uniformly effective on all used
E. Discussion
We have proposed the linear distance coding method, and
then verified its effectiveness on multiple types of benchmark
datasets. Here we evaluate the influence of the number of nearest neighbors on calculating distance and coding separately.
Particularly, we select the datasets Flower 102, Indoor 67 and
Caltech 101 with one per type to investigate the performance
under different values, where LLC is particularly employed.
d on Calculating Distance: In
1) Neighbor Number knn
Section III, we introduce the class manifolds to calculate
the distance of local feature to certain class with the aim
of reducing the complexity and the interference of noisy
d affects the final classification
features. To investigate how knn
performance, we provide the average classification accuracy
d {1, 2, 3, 4}, and the plot is shown
under different values knn
in Figure 10.
From these results, we have the following observations.
d than
First, the combined representation is more robust to knn
the individual distance vector, since the combination also
encapsulates the information from the original features, which
is not affected by this parameter. Second, the influence of this
parameter varies a lot on different datasets, especially when
only the distance vector is adopted. For example, the classification accuracy on the dataset Flower 102 keeps increasing when
d increases from 1 to 4. In fact, the performance has only
d = 1.
slight fluctuation when discarding the results under knn
Based on the observations of the influences on different
d = 3 is a good trade-off as our suggestion.
datasets, knn
c on Coding: Now we investigate
2) Neighbor Number knn
the effect of knn to the final classification performance, where

Classification Accuracy



Caltech 101 - Distance

Caltech 101 - Combine
Flower 102 - Distance
Flower 102 - Combine
Indoor 67 - Distance
Indoor 67 - Combine


Fig. 10. Classification accuracy of the proposed methods under different

d {1, 2, 3, 4}, where three types of data sets, Flower 102, Indoor 67,
and Caltech 101, are adopted. Compared to the individual distance vector,
d , as it provides more
the combination is more robust to the parameter knn
complete information. Image best viewed in color.


formance of the distance vector is relatively stable on

different datasets. For example, the optimal accuracy is
c = 10.
almost always achieved at knn
3) Combine: Due to taking advantages of both the stable
SIFT and the discriminative Distance, the combic across all
nation is most robust to the value of knn
different datasets. For example, its achieved almost the
c =
same accuracy on Flower 102 at different values knn
1, 2, 3, 4.
c is very influFrom the above analysis, the parameter knn
ential to performance when using original SIFT features to
perform LLC, but such dependence is relaxed for the transc = 10 is suggested
formed distance vector. In particular, knn
for both the individual distance vector and the combination in
this work.

Fig. 11. Classification accuracy curve of LLC (Original), LDC (Distance),

c {2, 5, 10, 20}, where
and their combination (Combine) for different knn
three types of data sets, Flower 102, Indoor 67, and Caltech 101, are adopted.
c . In particular,
Three methods have different trends as the variation of knn
the combination has the most slight diversification, i.e., the combination is
c . Image best viewed in
considered to be nonsensitive to the parameter knn

d = 3 is universally used for calculating the distance vector.

Similarly, we show the varying classification performance
under different values, as shown in Figure 11. In particular,
the results of LLC on the SIFT features is provided besides
that of the distance vector and the combination, where four
c {2, 5, 10, 20} are explored, as suggested in
values of knn
[2]. For fair comparison, all results here is produced by our
own implementations.
From Figure 11, the optimal parameter of different methods
heavily depends on the characteristics of the involved dataset,
e.g., the variations of images, the cluttered degree of the
background, etc. Here, we can summarize the observations
of Figure 11 for the different representations individually as

1) Sift: For the selected three datasets, the optimal parac = 2 for Flower 102,
meter is quite different, e.g., knn
while knn = 5 for Indoor 67 and Caltech 101. This may
be caused by the dependence of the optimal parameter
value on the interference of cluttered background. In
particular, the images in Flower 102 are all segmented,
which can significantly reduces the influence of background and a small neighborhood is sufficient.
2) Distance: The distance vector possesses different semantic from the original local feature introduced by our
proposed transformation. Compared with SIFT, the per-

In this paper, we propose linear distance coding method to

capture the discriminative information of local features and
relieve the dependence of spatial pooling on object layout
similarity of images. Consequently, the proposed method can
effectively improve the classification performance, which is
well verified on various types of datasets. In fact, the distance
vector is to extract the discriminative information based on the
image-to-class distance, which is motivated quite differently
from the traditional coding models. From the analysis and
the experiments, it is shown that the distance vector and the
original features are complementary to each other. Thus the
combination of two image representations can generally yield
higher classification performance.
Through comparing the classification results of the proposed method on different types of benchmark datasets, it is
concluded that the cluttered background would significantly
degrade the final classification performance because of its
influence on the salient features of different classes. Inspired
by this observation, we plan to design a new model to
reduce the interference of background aiming to improve the
classification performance, e.g., embedding the segmentation
results into the classification framework, which forms one of
our future directions.
[1] J. Yang, K. Yu, Y. Gong, and T. Huang, Linear spatial pyramid
matching using sparse coding for image classification, in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 17941801.
[2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, Localityconstrained linear coding for image classification, in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Jun. 2010, pp. 33603367.
[3] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004.
[4] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1.
Jun. 2005, pp. 886893.
[5] L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning
natural scene categories, in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., vol. 2. Jun. 2005, pp. 524531.
[6] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, Visual
word ambiguity, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7,
pp. 12711283, Jul. 2010.
[7] S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories, in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006, pp. 2169


[8] O. Boiman, E. Shechtman, and M. Irani, In defense of nearest-neighbor

based image classification, in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2008, pp. 18.
[9] J. van Gemert, J. Geusebroek, C. Veenman, and A. Smeulders, Kernel
codebooks for scene categorization, in Proc. Eur. Conf. Comput. Vis.,
Oct. 2008, pp. 696709.
[10] L. Liu, L. Wang, and X. Liu, In defense of soft-assignment coding,
in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 24862493.
[11] X. Zhou, K. Yu, T. Zhang, and T. Huang, Image classification using
super-vector coding of local image descriptors, in Proc. Eur. Conf.
Comput. Vis., vol. 5. Sep. 2010, pp. 141154.
[12] R. Behmo, P. Marcombes, A. S. Dalalyan, and V. Prinet, Toward
optimal naive Bayes nearest neighbor, in Proc. Eur. Conf. Comput.
Vis., vol. 4. Sep. 2010, pp. 171184.
[13] Z. Wang, Y. Hu, and L.-T. Chia, Image-to-class distance metric learning
for image classification, in Proc. Eur. Conf. Comput. Vis., vol. 1. Sep.
2010, pp. 706719.
[14] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell, The NBNN kernel,
in Proc. Int. Conf. Comput. Vis., vol. 1. Nov. 2011, pp. 18241831.
[15] J. Feng, B. Ni, Q. Tian, and S. Yan, Geometric p-norm feature pooling
for image classification, in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2011, pp. 26092704.
[16] S. Gao, I. Tsang, L. Chia, and P. Zhao, Local features are not lonely Laplacian sparse coding for image classification, in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., San Francisco, CA, Jun. 2010, pp. 3555
[17] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with
automatic algorithm configuration, in Proc. Int. Joint Conf. Comput.
Vis. Theory Appl., vol. 1. Lisboa, Portugal, Feb. 2009, pp. 331340.
[18] H. Jgou, M. Douze, and C. Schmid, Product quantization for nearest
neighbor search, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,
pp. 117128, Jan. 2011.
[19] K. Yu and T. Zhang, Improved local coordinate coding using local
tangents, in Proc. Int. Conf. Mach. Learn., Jun. 2010, pp. 12151222.
[20] M.-E. Nilsback and A. Zisserman, Automated flower classification over
a large number of classes, in Proc. Indian Conf. Comput. Vis., Graph.
Image Process., Dec. 2008, pp. 722729.
[21] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,
PFID: Pittsburgh fast-food image dataset, in Proc. Int. Conf. Image
Process., Nov. 2009, pp. 289292.
[22] A. Quattoni and A. Torralba, Recognizing indoor scenes, in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp.
[23] F.-F. Li, R. Fergus, and P. Perona, Learning generative visual models
from few training examples: An incremental Bayesian approach tested
on 101 object categories, Comput. Vis. Image Understand., vol. 106,
no. 1, pp. 5970, 2007.
[24] G. Griffin, A. Holub, and P. Perona, Caltech-256 object category
dataset, Dept. Comput. Sci., California Inst. Technology, Tech. Rep.
7694, Apr. 2007.
[25] A. Vedaldi and B. Fulkerson. (2008). VLfeat: An Open and
Portable Library of Computer Vision Algorithms [Online]. Available:
[26] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, Liblinear: A library
for large linear classification, J. Mach. Learn. Res., vol. 9, pp. 1871
1874, May 2008.
[27] X. Yuan and S. Yan, Visual classification with multi-task joint sparse
representation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2010, pp. 34933500.
[28] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, Food recognition
using statistics of pairwise local features, in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2010, pp. 22492256.
[29] A. Oliva and A. Torralba, Modeling the shape of the scene: A holistic
representation of the spatial envelope, Int. J. Comput. Vis., vol. 42,
no. 3, pp. 145175, 2001.
[30] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li, Object bank: A highlevel image representation for scene classification & semantic feature
sparsification, in Proc. Adv. Neural Inf. Process. Syst., Dec. 2010, pp.
[31] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce, Learning mid-level
features for recognition, in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2010, pp. 25592566.


[32] H. Zhang, A. C. Berg, M. Maire, and J. Malik, SVM-KNN: Discriminative nearest neighbor classification for visual category recognition,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun. 2006,
pp. 21262136.
[33] O. Duchenne, A. Joulin, and J. Ponce, A graph-matching kernel
for object categorization, in Proc. Int. Conf. Comput. Vis., vol. 5.
Barcelona, Spain, Nov. 2011, pp. 17921799.

Zilei Wang received the B.S. and Ph.D. degrees in

control theory and control engineering from the University of Science and Technology of China (USTC),
Hefei, China, in 2002 and 2007, respectively.
He is currently an Associate Professor with the
Department of Automation, USTC, and is also
with the Vision and Machine Learning Laboratory,
National University of Singapore, Singapore, as
a Research Fellow. His current research interests
include computer vision and media streaming techniques.

Jiashi Feng received the B.S. degree from the

University of Science and Technology of China,
Hefei, China, in 2007. He is currently pursuing
the Ph.D. degree with the Department of Electrical
and Computer Engineering, National University of
Singapore, Singapore.
His current research interests include computer
vision and machine learning.

Shuicheng Yan (M06SM09) is currently an

Assistant Professor with the Department of Electrical and Computer Engineering, National University of Singapore, where he is the Founding Lead of the Learning and Vision Research
Group ( His current research
interests include computer vision, multimedia, and
machine learning. He has authored or co-authored
over 200 technical papers.
He was a recipient of the Best Paper Award from
ICIMCS in 2009, ACMMM in 2010, and ICME in
2010, the Winner Prize of the Classification Task in PASCAL VOC in 2010,
the Honorable Mention Prize of the Detection Task in PASCAL VOC in 2010,
the TCSVT Best Associate Editor (BAE) Award in 2010, and the co-author
of the Best Student Paper Award of PREMIA in 2009 and PREMIA in 2011.
He is an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS AND
S YSTEMS FOR V IDEO T ECHNOLOGY, and the Guest Editor of the special
issues for TMM and CVIU.

Hongsheng Xi received the B.S. and M.S. degrees in

applied mathematics from the University of Science
and Technology of China (USTC), Hefei, China, in
1980 and 1985, respectively.
He is currently a Professor with the Department
of Automation, USTC, where he also directs the
Laboratory of Network Communication Systems and
Control. His current research interests include stochastic control systems, network performance analysis and optimization, wireless communications, and
signal processing.