ORI GI NAL ARTI CLE

Understanding bag-of-words model: a statistical framework
Yin Zhang

Rong Jin

Zhi-Hua Zhou
Received: 27 February 2010 / Accepted: 2 July 2010 / Published online: 28 August 2010
Ó Springer-Verlag 2010
Abstract The bag-of-words model is one of the most
popular representation methods for object categorization.
The key idea is to quantize each extracted key point into
one of visual words, and then represent each image by a
histogram of the visual words. For this purpose, a clus-
tering algorithm (e.g., K-means), is generally used for
generating the visual words. Although a number of studies
have shown encouraging results of the bag-of-words rep-
resentation for object categorization, theoretical studies on
properties of the bag-of-words model is almost untouched,
possibly due to the difficulty introduced by using a heu-
ristic clustering process. In this paper, we present a sta-
tistical framework which generalizes the bag-of-words
representation. In this framework, the visual words are
generated by a statistical process rather than using a clus-
tering algorithm, while the empirical performance is
competitive to clustering-based method. A theoretical
analysis based on statistical consistency is presented for the
proposed framework. Moreover, based on the framework
we developed two algorithms which do not rely on clus-
tering, while achieving competitive performance in object
categorization when compared to clustering-based bag-of-
words representations.
Keywords Object recognition Á Bag of words model Á
Rademacher complexity
1 Introduction
Inspired by the success of text categorization [6,10], a
bag-of-words representation becomes one of the most
popular methods for representing image content and has
been successfully applied to object categorization. In a
typical bag-of-words representation, ‘‘interesting’’ local
patches are first identified from an image, either by densely
sampling [14,25] or by a interest point detector [9]. These
local patches, represented by vectors in a high dimensional
space [9], are often referred to as the key points.
To efficiently handle these key points, the key idea is to
quantize each extracted key point into one of visual words,
and then represent each image by a histogram of the visual
words. This vector quantization procedure allows us to
represent each image by a histogram of the visual words,
which is often referred to as the bag-of-words representa-
tion, and consequently converts the object categorization
problem into a text categorization problem. A clustering
procedure (e.g., K-means) is often applied to group key
points from all the training images into a large number of
clusters, with the center of each cluster corresponding to a
different visual word. Studies [3,20] have shown promising
performance of bag-of-words representation in object cat-
egorization. Various methods [7,8,12,13,17,21,25] have
been proposed for the visual vocabulary construction to
improve both the computational efficiency and the classi-
fication accuracy of object categorization. However, to the
best of our knowledge, there is no theoretical analysis on
the statistical properties of vector quantization for object
categorization.
Y. Zhang Á Z.-H. Zhou (&)
National Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210093, China
e-mail: zhouzh@lamda.nju.edu.cn
Y. Zhang
e-mail: zhangyin@lamda.nju.edu.cn
R. Jin
Department of Computer Science & Engineering,
Michigan State University, East Lansing, MI 48824, USA
e-mail: rongjin@cse.msu.edu
123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
DOI 10.1007/s13042-010-0001-0
In this paper, we present a statistical framework which
generalizes the bag-of-words representation and aim to
provide a theoretical understanding for vector quantization
and its effect on object categorization from the viewpoint
of statistical consistency. In particular, we view
1. each visual word as a quantization function f
k
ðxÞ that is
randomly sampled from a class of functions F by an
unknown distribution P
F
; and
2. each key point of an image as a random sample from
an unknown distribution q
i
ðxÞ:
The above statistical description of key points and visual
words allows us to interpret the similarity between two
images in bag-of-words representation, the key quantity in
object categorization, as an empirical expectation over
the distributions q
i
ðxÞ and P
F
: Based on the proposed
statistical framework, we present two random algorithms
for vector quantization, one based on the empirical
distribution and the other based on kernel density estima-
tion. We show that both random algorithms for vector
quantization are statistically consistent in estimating the
similarity between two images. Our empirical study with
object recognition also verifies that the two proposed
algorithms (I) yield recognition accuracy that is compara-
ble to the clustering based bag-of-words representation,
and (II) are resilient to the number of visual words when
the number of training examples is limited. The success of
the two simple algorithms validates the proposed statistical
framework for vector quantization.
The rest of this paper is organized as follows. Section 2
presents the overview of existing approaches for key point
quantization that were used by object recognition. Sec-
tion 3 presents a statistical framework that generalizes the
classical bag-of-words representation, and two random
algorithms for vector quantization based on the proposed
framework. We show that both algorithms are statistically
consistent in estimating the similarity between two images.
Empirical study with object recognition reported in Sect. 4
shows encouraging results of the proposed algorithms for
vector quantization, which in return validates the proposed
statistical framework for the bag-of-words representation.
Section 5 concludes this work.
2 Related work
In object recognition and texture analysis, a number of
algorithms have been proposed for key point quantization.
Among them, K-means is probably the most popular
one. To reduce the high computational cost of K-means,
hierarchical K-means is proposed in [13] for more effi-
cient vector quantization. In [25], a supervised learning
algorithm is proposed to reduce the visual vocabulary that
is initially obtained by K-means, into a more descriptive
and compact one. Farquhar et al. [5] model the problem as
Gaussian mixture model where each visual words corre-
sponds to a Gaussian component and use the maximum a
posterior (MAP) approach to learn the parameter. A
method based on mean-shift is proposed in [7] for vector
quantization to resolve the problem that K-means tends to
‘starve’ medium density regions in feature space and each
key point is allocated to the first visual word similar to it.
Moosmann et al. [12] use extremely randomized clustering
forests to efficiently generate a highly discriminative cod-
ing of visual words. To minimize the loss of information in
vector quantization, Lazebnik and Raginsky [8] try to seek
a compressed representation of vectors that preserve the
sufficient statistics of features. In [16], images are char-
acterized using a set of category-specific histograms
describing whether the content can best be modeled by the
universal vocabulary or the specific vocabulary. Tuytelaars
and Schmid [21] propose a quantization method that
discretizes a feature space by a regular lattice. van
Gemert et al. [22] use kernel density estimation to
avoid the problem of ‘codeword uncertainty’ and ‘code-
word plausibility’.
Although many studies have shown encouraging results
of the bag-of-words representation for object categoriza-
tion, none of them provide statistical consistency analysis,
which reveals the asymptotic behavior of the bag-of-words
model for object recognition. Unlike the existing statistical
approaches for key point quantization that are designed to
reduce the training error, the proposed framework gener-
alizes the bag-of-words model by the statistical expecta-
tion, making it possible to analyze the statistical
consistency of the bag-of-words model. Finally, we would
like to point out that although several randomized
approaches [12,14,24] have been proposed for key point
quantization, none of them provides theoretical analysis on
statistical consistency. In contrast, we present not only the
theoretic results for the two proposed random algorithms
for vector quantization, but also the results of the empirical
study with object recognition that support the theoretic
claim.
3 A statistical framework for bag-of-words
representation
In this section, we first present a statistical framework for
the bag-of-words representation in object categorization,
followed by two random algorithms that are derived from
the proposed framework. The analysis of statistical con-
sistency is also presented for the two proposed algorithms.
44 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
123
3.1 A statistical framework
We consider the bag-of-words representation for images,
with each image being represented by a collection of local
descriptors. We denote by N the number of training images,
and by X
i
¼ ðx
1
i
; . . .; x
n
i
i
Þ the collection of key points used
to represent image I
i
where x
l
i
2 X; l ¼ 1; . . .; n
i
is a key
point in feature space X: To facilitate statistical analysis,
we assume that each key point x
l
i
in X
i
is randomly
drawn from an unknown distribution q
i
ðxÞ associated with
image I
i
:
The key idea of the bag-of-words representation is to
quantize each key point into one of the visual words that
are often derived by clustering. We generalize this idea of
quantization by viewing the mapping to a visual word v
k
2
X as a quantization function f
k
ðxÞ : X 7!½0; 1Š: Due to the
uncertainty in constructing the vocabulary, we assume that
the quantization function f
k
ðxÞ is randomly drawn from a
class of functions, denoted by F; via a unknown distribu-
tion P
F
: To capture the behavior of quantization, we
design the function class F as follows
F ¼ ff ðx; vÞjf ðx; vÞ ¼ Iðkx Àvk qÞ; v 2 Xg ð1Þ
where indicator function I(z) outputs 1 when z is true, or 0
otherwise. In the above definition, each quantization
function f ðx; vÞ is essentially a ball of radius q centered at
v: It outputs 1 when a point x is within the ball, and 0 if x is
outside the ball. This definition of quantization function is
clearly related to the vector quantization by data clustering.
Based on the above statistical interpretation of key
points and quantization functions, we can now provide a
statistical description for the histogram of visual words,
which is the key of bag-of-words representation. Let
^
h
k
i
denotes the normalized number of key points in image I
i
that are mapped to visual word v
k
: Given m visual words,
or m quantization functions ff
k
ðxÞg
m
k¼1
that are sampled
from F;
^
h
k
i
is computed as
^
h
k
i
¼
1
n
i

n
i
j¼1
f
k
x
l
i
_ _
¼
^
E
i
½f
k
ðxފ ð2Þ
where
^
E
i
½f
k
ðxފ stands for the empirical expectation of
function f
k
ðxÞ based on the samples x
1
i
; . . .; x
n
i
i
: We can
generalize the above computation by replacing the
empirical expectation
^
E
i
½f
k
ðxފ with an expectation over
the true distribution q
i
ðxÞ; i.e.,
h
k
i
¼ E
i
½f
k
ðxފ ¼
_
d xq
i
ðxÞf
k
ðxÞ: ð3Þ
The bag-of-words representation for image I
i
is expressed
by vector h
i
¼ ðh
1
i
; . . .; h
m
i
Þ:
In the next step, we analyze the pairwise similarity
between two images. It is important to note that the
pairwise similarity plays a critical role in any pattern
classification problems including object categorization.
According to the learning theory [18], it is the pairwise
similarity, not the vector representation of images, that
decides the classification performance. Using the vector
representation h
i
and h
j
; the similarity between two images
I
i
and I
j
; denoted by "s
ij
; is computed as
"s
ij
¼
1
m
h
T
i
h
j
¼
1
m

m
k¼1
E
i
½f
k
ðxފE
j
½f
k
ðxފ ð4Þ
Similar to the previous analysis, the summation in the
above expression can be viewed as an empirical
expectation over the sampled quantization functions
f
k
ðxÞ; k ¼ 1; . . .; m: We thus generalize the definition of
pairwise similarity in Eq. 4 by replacing the empirical
expectation with the true expectation, and obtain the true
similarity between two images I
i
and I
j
as
s
ij
¼ E
f $P
F
E
i
½f ðxފE
j
½f ðxފ
_ ¸
ð5Þ
According to the definition in Eq. 1, each quantization
function is parameterized by a center v: Thus, to define P
F
;
it suffices to define a distribution for the center v; denoted
by qðvÞ: Thus, Eq. 5 can be expressed as
s
ij
¼ E
v
E
i
½f ðxފE
j
½f ðxފ
_ ¸
: ð6Þ
3.2 Random algorithms for key point quantization
and their statistical consistency
We emphasize that the pairwise similarity in Eq. 6 can not
be computed directly. This is because both distributions
q
i
ðxÞ and qðvÞ are unknown, which makes it intractable to
compute E
i
½ÁŠ and E
v
½ÁŠ: In real applications, approxima-
tions are needed. In this section, we study how approxi-
mations will affect the estimation of pairwise similarity. In
particular, given the pairwise similarity estimated by dif-
ferent kinds of approximated distributions, we aim to
bound its difference to the underlying true similarity. To
simplify our analysis, we assume that each image has at
least n key points.
By assuming that the key points in all the images are
sampled from qðvÞ; we have an empirical distribution for
qðvÞ; i.e.,
^ qðvÞ ¼
1

N
i¼1
n
i

N
i¼1

n
i
l¼1
d v Àx
l
i
_ _
ð7Þ
where dðxÞ is a Dirac delta function that
_
dðxÞ dx ¼ 1
and dðxÞ ¼ 0 for x 6¼ 0: Direct estimation of pairwise
similarities using the above empirical distribution is
computationally expensive, because the number of key
points in all images can be very large. In the bag-of-words
model, m visual words are used as prototypes for the key
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 45
123
points in all the images. Let v
1
; . . .; v
m
be the m visual
words randomly sampled from the key points in all the
images. The empirical distribution ^ qðvÞ is
^ qðvÞ ¼
1
m

m
k¼1
dðv Àv
k
Þ ð8Þ
In the next step, we aim to approximate the unknown
distribution q
i
ðxÞ in two different ways, and show the
statistical consistency for each approximation.
3.2.1 Empirically estimated density function for q
i
ðxÞ
First we approximate q
i
ðxÞ by the empirical distribution
^ q
i
ðxÞ defined as follows
^ q
i
ðxÞ ¼
1
n
i

n
i
l¼1
dðx Àx
l
i
Þ ð9Þ
Given the approximations for distribution q
i
ðxÞ and qðvÞ;
we can now compute the approximation of the pairwise
similarity s
ij
defined in Eq. 6. For Eq. 9, the pairwise
similarity, denoted by ^s
ij
; is computed as
^s
ij
¼
^
E
v
^
E
i
½f ðxފ
^
E
j
½f ðxފ
_ ¸
¼
1
m

m
k¼1
1
n
i

n
i
l¼1
I x
l
i
Àv
k
_
_
_
_
q
_ _
_ _
Â
1
n
j

n
j
l¼1

_
_
x
l
j
Àv
k
_
_

_ _
ð10Þ
To show the statistical consistency of ^ s
ij
; we need to
bound js
ij
À ^ s
ij
j: Since there are two approximate distri-
bution used in our estimation, we divide our analysis into
two steps. First, we measure j"s
ij
Às
ij
j; i.e., the difference
in similarity caused by the approximate distribution for
P
F
: Next, we measure j^s
ij
À" s
ij
j; i.e., the difference
caused by using the approximate distribution for q
i
ðxÞ:
The overall difference js
ij
À ^ s
ij
j is bounded by the sum of
the two difference.
We first state the McDiarmid inequality [11], which is
used throughout our analysis.
Theorem 1 (McDiarmid Inequality) Given independent
random variables v
1
, v
2
,..., v
n
, v
0
i
[ V, and a function f :
V
n
7!R satisfying
sup
v
1
;v
2
;...;v
n
;v
0
i
2V
jf ðvÞ Àf ðv
0
Þj c
i
ð11Þ
where v ¼ ðv
1
; v
2
; . . .; v
n
Þ and v
0
¼ ðv
1
; v
2
; . . .; v
iÀ1
; v
0
i
;
v
iþ1
; . . .; v
n
Þ; then the following statement holds
Pr jf ðvÞ ÀEðf ðvÞÞj ! ð Þ 2 exp À
2
2

n
i¼1
c
2
i
_ _
ð12Þ
Using the McDiarmid inequality, we have the following
theorem which bounds j"s
ij
Às
ij
j:
Theorem 2 Assuming f
k
ðxÞ; k ¼ 1; . . .; m are randomly
drawn from class Faccording to an unknown distribution.
And further assuming that any function in Fis universally
bounded between 0 and 1. With probability 1 - d, the
following inequality holds for any two training images I
i
and I
j
j"s
ij
Às
ij
j
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2m
ln
2
d
_
ð13Þ
Proof For any f 2 F; we have 0 E
i
½f ðxފE
j
½f ðxފ 1:
Thus, for any k, c
k
B 1/m. By setting
d ¼ 2 exp À2m
2
_ _
; or ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2m
ln
2
d
_
; ð14Þ
we have Prðj"s
ij
Às
ij
j Þ !1 Àd:
The above theorem indicates that, if we have the true
distribution q
i
ðxÞ of each image I
i
; with a large number of
sampled quantization functions f
k
ðxÞ; we have a very good
chance to recover the true similarity s
ij
with a small error.
The next theorem bounds j^ s
ij
Às
ij
j:
Theorem 3 Assuming each image has at least n ran-
domly sampled key points. Also assuming that f
k
ðxÞ; k ¼
1; . . .; m randomly drawn from an unknown distribution
over class F: With probability 1 - d, the following
inequality is satisfied for any two images I
i
and I
j
j^s
ij
Às
ij
j
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2m
ln
2
d
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2n
ln
4m
2
d
_
ð15Þ
Proof We first need to bound the difference between
^
E
i
½f
k
ðxފ and E
i
½f
k
ðxފ: Since 0 f ðxÞ 1 for any f 2 F;
using McDimard inequality, we have
Pr
^
E
i
f
k
ðxÞ ½ Š ÀE
i
f
k
ðxÞ ½ Š
¸
¸
¸
¸
!
_ _
expðÀ2n
2
Þ ð16Þ
By setting
2 expðÀ2n
2
Þ ¼
d
2m
2
; or ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2n
ln
4m
2
d
_ _
¸
with probability 1 - d/2, we have j
^
E
i
½f
k
ðxފ ÀE
i
½f
k
ðxފj
and j
^
E
j
½f
k
ðxފ ÀE
j
½f
k
ðxފj for all f
k
ðxÞ
m
k¼1
simulta-
neously. As a result, with probability 1 - d/2, for any two
image I
i
and I
j
; we have
46 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
123
j^s
ij
À" s
ij
j
1
m

m
k¼1
j
^
E
k
i
^
E
k
j
ÀE
k
i
E
k
j
j

1
m

m
k¼1

^
E
k
i
ÀE
k
i
Þ
^
E
k
j
j þjE
k
i
ð
^
E
k
j
ÀE
k
j
Þj

1
m

m
k¼1
j
^
E
k
i
ÀE
k
i
j þj
^
E
k
j
ÀE
k
j
j
2 ¼ 2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2n
ln
4m
2
d
_ _
¸
ð17Þ
where
^
E
k
i
stands for
^
E
i
½f
k
ðxފ for simplicity. According to
Theorem 2, with probability 1 - d/2, we have
j"s
ij
Às
ij
j
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2m
ln
2
d
_
ð18Þ
Combining Eqs. 17 and 18, we have the result in the
theorem. With probability 1 - d, the following inequality
is satisfied
j^s
ij
Às
ij
j
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2m
ln
2
d
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
2n
ln
4m
2
d
_
ð19Þ
Remark Theorem 3 reveals an interesting relationship
between the estimation error js
ij
À^s
ij
j and the number of
quantization functions (or the number of visual words). The
upper bound in Theorem 3 consists of two terms: the first
term decreases at a rate of Oð1=
ffiffiffiffi
m
p
Þ while the second term
increases at a rate of Oðln mÞ: When the number of visual
words m is small, the first term dominates the upper
bound, and therefore increasing m will reduce the difference
j^s
ij
Às
ij
j: As m becomes significantly larger than n, the
second term will dominate the upper bound, and therefore
increasing m will lead to a larger j^s
ij
Às
ij
j: This result
appears to be consistent with the observations on the size
of the visual vocabulary: a large vocabulary tends to
performance well in object categorization; but, too many
visual words could deteriorate the classification accuracy.
Finally, we emphasize that although the idea of vector
quantization by randomly sampled centers was already
discussed in [7,24], to the best of our knowledge, this is the
first work that presents its statistical consistency analysis.
3.2.2 Kernel density function estimation for q
i
ðxÞ
In this section, we approximate q
i
ðxÞ by a kernel density
estimation. To this end, we assume that the density func-
tion q
i
ðxÞ belongs to a family of smooth functions F
D
that
is defined as follows
F
D
¼ qðxÞ : X7!R
þ
qðxÞ; qðxÞ h i
H
j
¸
¸
¸ B
2
;
_
qðxÞ dx ¼1
_ _
ð20Þ
where jðx; x
0
Þ : X ÂX7!R
þ
is a local kernel function with
_
jðx; x
0
Þ dx
0
¼1: B controls the functional norm of qðxÞ in
the reproducing kernel Hilbert space H
j
: An example of
jðx; x
0
Þ is RBF function, i.e. jðx; x
0
Þ /expðÀkdðx; x
0
Þ
2
Þ;
where dðx; x
0
Þ ¼kx Àx
0
k
2
: Then, the distribution q
i
ðxÞ is
approximated by a kernel density estimation ~ q
i
ðxÞ defined
as follows
~ q
i
ðxÞ ¼

n
i
l¼1
a
l
i
j x; x
l
i
_ _
; ð21Þ
where a
i
l
(1 B l B n
i
) are the combination weight that sat-
isfy (i) a
i
l
C 0, (ii)

n
i
l¼1
a
l
i
¼ 1; and (iii) a
i
K
i
a
i
B B
2
,
where K
i
¼ ½jðx
l
i
; x
l
0
i
ފ
n
i
Ân
i
:
Using the kernel density function, we approximate the
pairwise similarity for Eq. 21 as follows
~s
ij
¼
^
E
v
~
E
i
½f ðxފ
~
E
j
½f ðxފ
_ ¸
¼
1
m

m
k¼1

n
i
l¼1
a
l
i
h x
l
i
; v
k
_ _
_ _

n
j
l¼1
a
l
j
h x
l
j
; v
k
_ _
_ _
ð22Þ
where function hðx; vÞ is defined as
hðx; vÞ ¼
_
dz I dðz; vÞ q ð Þjðx; zÞ ð23Þ
To bound the difference between ~s
ij
and s
ij
, we follow the
analysis [19] by viewing E
i
½f ðxފE
j
½f ðxފ as a mapping,
denoted by g : F 7!R
þ
; i.e.,
gðf ; q
i
; q
j
Þ ¼ E
i
½f ðxފE
j
½f ðxފ ð24Þ
The domain for function g, denoted by G; is defined as
G ¼ g : F 7!R
þ
9q
i
; q
j
2 F
D
s.t. gðf Þ ¼E
i
½f ðxފE
j
½f ðxފ
¸
¸
_ _
ð25Þ
To bound the complexity of a class of functions, we
introduce the concept of Randemacher complexity [2]:
Definition 1 (Randemacher complexity) Suppose x
1
,..., x
n
are sampled from a set X with i.i.d. Let F be a class of
functions mapping from X to R: The Randemacher com-
plexity of F is defined as
R
n
ðFÞ ¼ E
x
1
;...;x
n
;r
sup
f 2F
2
n

n
i¼1
r
i
f ðx
i
Þ
_ _
ð26Þ
where r
i
is independent uniform ± 1-valued random
variables.
Assuming at least n key points are randomly sampled
from each image, we have the following lemmas that
bounds the complexity of domain G :
Lemma 1 The Rademacher complexity of function class
G; denoted by R
m
ðGÞ; is bounded as
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 47
123
R
m
ðGÞ 2BC
j
E
f
_
dxjf ðxÞj
_ ¸
ffiffiffiffi
m
p ð27Þ
where C
j
¼ max
x;z
ffiffiffiffiffiffiffiffiffiffiffiffiffi
jðx; zÞ
_
Proof Denote F = {f
1
,..., f
m
}, according to the definition,
we have
R
m
ðGÞ ¼ E
r;F
sup
g2G
2
m

m
k¼1
r
k
gðf
k
Þ
_ _
¼ E
F
E
r
sup
g2G
2
m

m
k¼1
r
k
gðf
k
Þ
_ ¸
¸
¸
¸
¸
F
_ _ _
¼ E
F
E
r
sup
q
i
;q
j
2F
D
2
m

m
k¼1
r
k
E
i
½f
k
ŠE
j
½f
k
Š
¸
¸
¸
¸
¸
F
_ _ _ _
E
F
E
r
sup
kx
i
k B
2
m

m
k¼1
r
k
E
i
½f
k
Š
¸
¸
¸
¸
¸
F
_ _ _ _
¼
2
m
E
F
E
r
sup
kx
i
k B
x
i
;

m
k¼1
r
k
U
k
_ _¸
¸
¸
¸
¸
F
_ _ _ _
where U
k
¼ h/
1
ðÁÞ; f
k
ðÁÞi; h/
2
ðÁÞ; f
k
ðÁÞi; . . . ð Þ
and /
k
ðxÞ is an eigen function
of jðx; x
0
Þ

2B
m
E
F
E
r

m
k¼1
r
k
U
k
_
_
_
_
_
_
_
_
_
_
¸
¸
¸
¸
¸
F
_ _ _ _
¼
2B
m
E
F
E
r

k;t
r
k
r
t
U
k
; U
t
h i
_ _1
2
_
_
¸
¸
¸
¸
¸
¸
F
_
_
_
_
_
_

2B
m
E
F

k;t
E
r
r
k
r
t
U
k
; U
t
ðxÞ h ijF ½ Š
_ _1
2
_
_
_
_
¼
2B
m
E
F

k
E
r
r
2
k
U
k
; U
k
h i
¸
¸
F
_ ¸
_ _1
2
_
_
_
_
¼
2B
m
E
F

k
U
k
; U
k
h i
_ _1
2
_
_
_
_
¼
2B
m
E
F

k
_
dz dxf
k
ðxÞf
k
ðzÞjðx; zÞ
_ _1
2
_
_
_
_
2BC
j
E
f
_
dxjf ðxÞj
_ ¸
ffiffiffiffi
m
p
ð28Þ
where the first inequality is because E
j
½f
k
Š 1; the second
inequality is from Cauchy’s inequality, the third and fourth
inequalities are from Jensen’s inequality. The last equality
follows
hU
k
; U
k
i ¼

i
/
i
ðÁÞ; f
k
ðÁÞ h i
2
¼
_
dz dxf
k
ðxÞf
k
ðzÞjðx; zÞ
ð29Þ
From [2], we have the following lemmas:
Lemma 2 (Theorem 12 in [2]) For 1 B q \?, let L ¼
fjf Àhj
q
: f 2 Fg; wherehand kf Àhk
1
is uniformly
bounded. We have
R
n
ðLÞ 2qkf Àhk
1
R
n
ðFÞ þ
khk
1
ffiffiffi
n
p
_ _
ð30Þ
Lemma 3 (Theorem 8 in [2]) With probability 1 - d the
following inequality holds
E/ðY; f ðXÞÞ
^
E
n
/ðY; f ðXÞÞ þR
n
ð/ FÞ þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
8 lnð2=dÞ
n
_
ð31Þ
where /(x, y) is the loss function, n is the number
of samples and / F ¼ fðx; yÞ 7!/ðy; f ðxÞÞ À/ðy; 0Þ :
f 2 Fg:
Based on the above lemmas, we have the following
theorem
Theorem 4 Assume that the density function
q
i
ðxÞ; q
j
ðxÞ 2 F
D
: Let ~ q
i
ðxÞ; ~ q
j
ðxÞ 2 F
D
be an estimated
density function from n sampled key points. We have, with
probability 1 - d, the following inequality holds
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ
^
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; ^ q
i
; ^ q
j
ÞjŠ
þ2 2BC
j
E
f
_
dxjf ðxÞj
_ ¸
ffiffiffiffi
m
p þ
1
ffiffiffiffi
m
p
_ _
þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8=dÞ
2m
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8m
2
=dÞ
2n
_
ð32Þ
Proof From Lemma 3, with probability 1 - d/2, we have
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ

^
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ
þR
m
ðjG Àgðf ; q
i
; q
j
ÞjÞ þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
8 lnð4=dÞ
m
_
ð33Þ
Since 0 B g(f; q
i
, q
j
) B 1, using the results in Lemma 1
and 2, we have
R
m
jG Àgðf ; q
i
; q
j
Þj
_ _
2 R
m
ðGÞ þ
1
ffiffiffiffi
m
p
_ _
2 2BC
j
E
f
_
dxjf ðxÞj
_ ¸
ffiffiffiffi
m
p þ
1
ffiffiffiffi
m
p
_ _
ð34Þ
Hence, we have, with probability 1 - d/2 the following
inequality holds
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ
^
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ
þ2 2BC
j
E
f
_
dxjf ðxÞj
_ ¸
ffiffiffiffi
m
p þ
1
ffiffiffiffi
m
p
_ _
þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
8 lnð4=dÞ
m
_
ð35Þ
48 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
123
Next, we aim to bound
^
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ: Note
that
^
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ
¼
1
m

m
k¼1
jgðf
k
; ~ q
i
; ~ q
j
Þ Àgðf
k
; q
i
; q
j
Þj

1
m

m
k¼1
jgðf
k
; ~ q
i
; ~ q
j
Þ Àgðf
k
; ^ q
i
; ^ q
j
Þj þjgðf
k
; ^ q
i
; ^ q
j
Þ
_
Àgðf
k
; q
i
; q
j
Þj
_
ð36Þ
Using the same logistics in the proof of Theorem 3, we
have, with probability 1 - d/2
1
m

m
k¼1
jgðf
k
; ^ q
i
; ^ q
j
Þ Àgðf
k
; q
i
; q
j
Þj

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8=dÞ
2m
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8m
2
=dÞ
2n
_
ð37Þ
Fromthe above results, we have, with probability 1 - d/2,
the following inequality holds
1
m

m
k¼1
jgðf
k
; ~ q
i
; ~ q
j
Þ Àgðf
k
; q
i
; q
j
Þj

1
m

m
k¼1
jgðf
k
; ~ q
i
; ~ q
j
Þ Àgðf
k
; ^ q
i
; ^ q
j
Þj
þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8=dÞ
2m
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8m
2
=dÞ
2n
_
ð38Þ
Combining the above results together, we have, with
probability 1 - d, the following inequality holds
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ

^
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; ^ q
i
; ^ q
j
ÞjŠ
þ2 2BC
j
E
f
_
dxjf ðxÞj
_ ¸
ffiffiffiffi
m
p þ
1
ffiffiffiffi
m
p
_ _
þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8=dÞ
2m
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8m
2
=dÞ
2n
_
ð39Þ
In our empirical study, we will use RBF kernel function
for jðx; x
0
Þ with a
i
l
= 1/n
i
. The corollary below shows the
bound for this choice of kernel density estimation.
Corollary 5 When the kernel function jðx; x
0
Þ ¼
ð1=ð2pr
2
ÞÞ
d=2
expðÀkx Àx
0
k
2
2
=ð2r
2
ÞÞ and a
i
l
= 1/n
i
, the
bound in Theorem 4 becomes
E
f
½jgðf ; ~ q
i
; ~ q
j
Þ Àgðf ; q
i
; q
j
ÞjŠ
1=ð2pr
2
Þ
_ _
d=2
1 Àexp Àq
2
=ð2r
2
Þ
_ _ _ _
þ2
2E
f
_
dxjf ðxÞj
_ ¸
=
ffiffiffiffi
n
i
p
þ1
ffiffiffiffi
m
p
þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8=dÞ
2m
_
þ2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8m
2
=dÞ
2n
_
ð40Þ
Remark Theorem 4 bounds the true expectation of the
difference between the similarity estimated by kernel
density function and the true similarity. Similar to
Theorem 3, this bound also consists of a term decreasing
at a rate of Oð1=
ffiffiffiffi
m
p
Þ and a term increasing at a rate of
Oðln mÞ: What’s more, we can see in order to minimize the
true expectation of the difference between the similarity
estimated by kernel density function and the true similarity,
we need to minimize the empirical expectation of the
difference between the similarity estimated by kernel
density function and the similarity estimated by empirical
density function. If jðx; vÞ decreases exponentially as
dðx; vÞ decreases, such as Gaussian kernel, we have hðx; vÞ
close to 1 when dðx; vÞ q while hðx; vÞ close to 0 when
dðx; vÞ [q: In such circumstance, setting a
i
l
= 1/n
i
for all
1 B l B n
i
is a good choice for the approximation and is
also very efficient since we do not need to learn a.
Note that although the idea of kernel density estimation
was already proposed in some studies [e.g., 22], to the best
of our knowledge, this is the first work that reveals the
statistical consistency of kernel density estimation for the
bag-of-words representation.
4 Empirical study
In this empirical study, we aim to verify the proposed
framework and the related analysis. To this end, based on
the discussion in Sect. 3.2, we present two random algo-
rithms for vector quantization that are shown in Algo-
rithm 1. We refer to the algorithm based on empirical
distribution as ‘‘Quantization via Empirical Estimation’’,
or QEE for short, and to the algorithm based on kernel
density estimation as ‘‘Quantization via Kernel Estima-
tion’’, or QKE for short. Note that since both vector
quantization algorithms do not rely on the clustering
algorithms to identify visual words, they are in general
computationally more efficient. In addition, both algo-
rithms have error bounds decreases at the rate of
Oð1=
ffiffiffiffi
m
p
Þ when the number of key points n is large,
indicating that they are robust to the number of visual
words m. We emphasize that although similar random
algorithms for vector quantization have been discussed in
[5,17,22,14,24], the purpose of this empirical study is to
verify that
• simple random algorithms deliver similar performance
of object recognition as the clustering based algorithm,
and
• the random algorithms are robust to the number of
visual words, as predicted by the statistical consistency
analysis.
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 49
123
Finally in the implementation of QKE, to efficiently
calculate h-function, we approximate it as [1]
h %

~
d À ~ qÞ
2
À1
4
ffiffiffi
p
p
ð
~
d À ~ qÞ
3
expð
~
d À ~ qÞ
2
À

~
d þ ~ qÞ
2
À1
4
ffiffiffi
p
p
ð
~
d þ ~ qÞ
3
expð
~
d þ ~ qÞ
2
ð41Þ
where
~
d ¼ d=r; ~ q ¼ q=r and r is the width of the Gaussian
kernel.
Two data sets are used in our study: PASCAL VOC
Challenge 2006 data set [4] and Graz02 data set [15].
PASCAL06 contains 5,304 images from 10 classes. We
randomly select 100 images for training and 500 for test-
ing. The Graz02 data set contains 365 bike images, 420 car
images, 311 people images and 380 background images.
We randomly select 100 images from each class for
training, and use the remaining for testing. By using a
relatively small number of examples for training, we are
able to examine the sensitivity of a vector quantization
algorithm to the number of visual words. On average 1,000
key points are extracted from each image, and each key
point is represented by the SIFT local descriptor [23]. For
PASCAL06 data set, the binary classification performance
for each object class is measured by the area under the
ROC curve (AUC). For Graz02 data set, the binary clas-
sification performance for each object class is measured by
the accuracy. Results averaged over ten random trials are
reported.
We compare three vector quantization methods:
K-means, QEE and QKE. Note that we do not include more
advanced algorithms for vector quantization in our study
because the objective of this study is to validate the pro-
posed statistical framework for bag-of-words representa-
tion and the analysis on statistical consistency. Threshold q
used by quantization functions f ðxÞ is set as q ¼ 0:5 Â
"
d;
where
"
d is the average distance between all the key points
and the randomly selected centers. A RBF kernel is used in
QKE with the kernel width r is set as 0:75
"
d according to
our experience. Binary linear SVM is used for each clas-
sification problem. To examine the sensitivity to the
number of visual words, for both data sets, we varied the
number of visual words from 10 to 10,000, as shown in
Figs. 1 and 2.
First, we observe that the proposed algorithms for vector
quantization yield comparable if not better performance
than the K-means clustering algorithm. This confirms the
proposed statistical framework for key point quantization is
effective. Second, we observe that the clustering based
approach for vector quantization tends to perform worse,
sometimes very significantly, when the number of visual
words is large. We attribute this instability to the fact that
K-means requires each interest point belongs to exactly one
visual word. If the number of clusters is not appropriate, for
example, too large compared to the number of instances,
two relevant key points may be separated into different
clusters although they are both very near to the boundary. It
will lead to a poor estimation of pairwise similarity. The
problem of ‘‘hard assignment’’ was also observed in
[17,22]. In contrast, for the proposed algorithms, we
observe a rather stable improvement as the number of
visual words increases, consistent with our analysis in
statistical consistency.
5 Conclusion
The bag-of-words model is one of the most popular repre-
sentation methods for object categorization. The key idea is
to quantize each extracted key point into one of visual words,
and then represent each image by a histogram of the visual
words. For this purpose, a clustering algorithm (e.g.,
K-means), is generally used for generating the visual words.
Although a number of studies have shown encouraging
results of the bag-of-words representation for object
categorization, theoretical studies on properties of the bag-
of-words model is almost untouched, possibly due to the
difficulty introduced by using a heuristic clustering process.
In this paper, we present a statistical framework which
generalizes the bag-of-words representation. In this frame-
work, the visual words are generated by a statistical process
rather than using a clustering algorithm, while the empirical
performance is competitive to clustering-based method.
A theoretical analysis based on statistical consistency is
presented for the proposed framework. Moreover, based on
the framework we developed two algorithms which do not
rely on clustering, while achieving competitive performance
50 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
123
in object categorization when compared to clustering-based
bag-of-words representations.
Bag-of-words representation is a popular approach to
object categorization. Despite its success, few studies are
devoted to the theoretic analysis of the bag-of-words rep-
resentation. In this work, we present a statistical framework
for key point quantization that generalizes the bag-of-
words model by statistical expectation. We present two
random algorithms for vector quantization where the visual
words are generated by a statistical process rather than
using a clustering algorithm. A theoretical analysis of their
statistical consistency is presented. We also verify the
efficacy and the robustness of the proposed framework by
applying it to object recognition. In the future, we plan to
examine the dependence of the proposed algorithms on the
threshold q, and extend QKE to weighted kernel density
estimation.
Acknowledgments We want to thank the reviewers for helpful
comments and suggestions. This research is partially supported by the
National Fundamental Research Program of China (2010CB327903),
the Jiangsu 333 High-Level Talent Cultivation Program and the
National Science Foundation (IIS-0643494). Any opinions, findings
and conclusions or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views of the
funding agencies.
References
1. Abramowitz M, Stegun IA (eds) (1972) Handbook of mathe-
matical functions with formulas, graphs, and mathematical tables.
Dover, New York
10 20 50 100 200 500 1,0002,000 5,000 10,000
0.6
0.65
0.7
0.75
0.8
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 500 1,000 2,000 5,000 10,000
0.6
0.65
0.7
0.75
0.8
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 500 1,000 2,000 5,000 10,000
0.65
0.7
0.75
0.8
0.85
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 500 1,000 2,000 5,000 10,000
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
No. of visual words
A
U
C


K−means
QEE
QKE
bicycle bus
car cat
10 20 50 100 200 500 1,000 2,000 5,000 10,000
0.6
0.65
0.7
0.75
0.8
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 500 1,0002,000 5,00010,000
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 500 1,0002,000 5,000 10,000
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 500 1,0002,000 5,000 10,000
0.55
0.6
0.65
0.7
0.75
No. of visual words
A
U
C

K−means
QEE
QKE
cow
dog horse
motor
10 20 50 100 200 500 1,000 2,000 5,000 10,000
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
No. of visual words
A
U
C


K−means
QEE
QKE
10 20 50 100 200 5001,000 2,000 5,000 10,000
0.65
0.7
0.75
0.8
No. of visual words
A
U
C


K−means
QEE
QKE
person
sheep
Fig. 1 Comparison of different quantization methods with varied number of visual words on PASCAL06
10 20 50 100 200 500 1,000 2,000 5,000 10,000
0.4
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
No. of visual words
A
c
c
u
r
a
c
y


K−means
QEE
QKE
Fig. 2 Comparison of different quantization methods with varying
number of visual words on Graz02
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 51
123
2. Bartlett PL, Wang M (2002) Rademacher and Gaussian com-
plexities: risk bounds and structural results. J Mach Learn Res
3:463–482
3. Csurka G, Dance C, Fan L, Williamowski J, Bray C (2004)
Visual categorization with bags of keypoints. In: ECCV work-
shop on statistical learning in computer vision, Prague, Czech
Republic, 2004
4. Everingham M, Zisserman A, Williams CKI, Van Gool L (2006)
The PASCAL visual object classes challenge 2006 (VOC2006)
results. http://www.pascal-network.org/challenges/VOC/voc2006/
results.pdf
5. Farquhar J, Szedmak S, Meng H, Shawe-Taylor J (2005)
Improving ‘‘bag-of-keypoints’’ image categorisation. Technical
report, University of Southampton
6. Joachims T (1998) Text categorization with suport vector
machines: learning with many relevant features. In: Proceedings
of the 10th European conference on machine learning. Chemnitz,
Germany, pp 137–142
7. Jurie F, Triggs B (2005) Creating efficient codebooks for visual
recognition. In: Proceedings of the 10th IEEE international con-
ference on computer vision, Beijing, China, 2005, pp 604–610
8. Lazebnik S, Raginsky M (2009) Supervised learning of quantizer
codebooks by information loss minimization. IEEE Trans Pattern
Anal Mach Intell 31(7):1294–1309
9. Lowe D (2004) Distinctive image features from scale-invariant
keypoints. Int J Comput Vis 60(2):91–110
10. McCallum A, Nigam K (1998) A comparison of event models for
naive bayes text classification. In: AAAI workshop on learning
for text categorization, Madison, WI
11. McDiarmid C (1989) On the method of bounded differences. In:
Surveys in combinatorics 1989, pp 148–188
12. Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual
codebooks using randomized clustering forests. In: Scho¨lkopf B,
Platt J, Hoffman T (eds) Advances in neural information pro-
cessing systems, vol 19. MIT Press, Cambridge, pp 985–992
13. Nister D, Stewenius H (2006) Scalable recognition with a
vocabulary tree. In: Proceedings of the IEEE computer society
conference on computer vision and pattern recognition, New
York, NY, pp 2161–2168
14. Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-
features image classification. In: Proceedings of the 9th European
conference on computer vision, Graz, Austria, pp 490–503
15. Opelt A, Pinz A, Fussenegger M, Auer P (2006) Generic object
recognition with boosting. IEEE Trans Pattern Anal Mach Intell
28(3):416–431
16. Perronnin F, Dance C, Csurka G, Bressian M (2006) Adapted
vocabularies for generic visual categorization. In: Proceedings of
the 9th European conference on computer vision, Graz, Austria,
pp 464–475
17. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in
quantization: improving particular object retrieval in large scale
image databases. In: Proceedings of the IEEE computer society
conference on computer vision and pattern recognition, Anchor-
age, AK
18. Scho¨lkopf B, Smola AJ (2002) Learning with kernels: support
vector machines, regularization, optimization, and beyond. MIT
Press, Cambridge
19. Shawe-Taylor J, Dolia A (2007) A framework for probability
density estimation. In: Proceedings of the 11th international
conference on artificial intelligence and statistics, San Juan,
Puerto Rico, pp 468–475
20. Sivic J, Zisserman A (2003) Video Google: A text retrieval
approach to object matching in videos. In: Proceedings of the 9th
IEEE international conference on computer vision, Nice, France,
pp 1470–1477
21. Tuytelaars T, Schmid C (2007) Vector quantizing feature space
with a regular lattice. In: Proceedings of the 11th IEEE interna-
tional conference on computer vision, Rio de Janeiro, Brazil,
pp 1–8
22. van Gemert JC, Geusebroek J-M, Veenman CJ, Smeulders AWM
(2008) Kernel codebooks for scene categorization. In: Proceed-
ings of the 10th European conference on computer vision,
Marseille, France, pp 696–709
23. Vedaldi A, Fulkerson B (2008) VLFeat: An open and portable
library of computer vision algorithms. http://www.vlfeat.org/
24. Viitaniemi V, Laaksonen J (2008) Experiments on selection of
codebooks for local image feature histograms. In: Proceedings of
the 10th international conference series on visual information
systems, Salerno, Italy, pp 126–137
25. Winn J, Criminisi A, Minka T (2005) Object categorization by
learned universal visual dictionary. In: Proceedings of the 10th
IEEE international conference on computer vision, Beijing,
China, pp 1800–1807
52 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
123

Sign up to vote on this title
UsefulNot useful