Professional Documents
Culture Documents
817
of the data matrix X ∈PRn×d . We assume that the points
1 1 1
n
are zero-centered, i.e., i=1 xi = 0. Our goal is to learn
a binary code matrix B ∈ {−1, 1}n×c , where c denotes the
code length.1 For each bit k = 1, . . . , c, the binary encoding
0 0 0
818
that if W is an optimal solution, then so is W
f = W R for 6
x 10
1.59 50
PCA+ITQ
Q(B, R)
a way as to minimize the quantization loss
1.575 20
loss can be found in Jégou et al. [7]. However, the approach (a) Quantization error. (b) Training time.
of [7] is based not on binary codes, but on product quanti- Figure 2. (a) Quantization error for learning a 32-bit ITQ code on
zation with asymmetric distance computation (ADC). Un- the CIFAR dataset (see Section 3.1). (b) Running time for learning
like in our formulation, direct minimization of quantization different 32-bit encodings on the Tiny Images dataset. The timings
loss for ADC is impractical, so Jégou et al. instead sug- were obtained on a workstation with a 2-core Xeon 3.33GHZ CPU
gest solving an easier problem, that of finding a rotation and 32G memory.
(or, more precisely, an orthogonal transformation) to bal-
ance the variance of the different dimensions of the data.
point set with another. In our case, the two point sets are
In practice, they find that a random rotation works well for
given by the projected data V and the target binary code
this purpose. Based on this observation, a natural baseline
matrix B. For a fixed B, (2) is minimized as follows: first
for our method is given by initializing R to a random or-
compute the SVD of the c × c matrix B T V as SΩŜ T and
thogonal matrix.
then let R = ŜS T .
Beginning with the random initialization of R, we adopt
a k-means-like iterative quantization (ITQ) procedure to We alternate between updates to B and R for several iter-
find a local minimum of the quantization loss (2). In each it- ations to find a locally optimal solution. The typical behav-
eration, each data point is first assigned to the nearest vertex ior of the error (2) is shown in Figure 2(a). In practice, we
of the binary hypercube, and then R is updated to minimize have found that we do not need to iterate until convergence
the quantization loss given this assignment. These two al- to achieve good performance, and we use 50 iterations for
ternating steps are described in detail below. all experiments. Figure 2(b) shows the training time for 32-
bit codes for our method and several competing methods
Fix R and update B. Expanding (2), we have
that are evaluated in the next section. All the methods scale
Q(B, R) = kBk2F + kV k2F − 2 tr(BRT V T ) linearly with the number of images. Although our method
is somewhat slower than others, it is still very fast and prac-
= nc + kV k2F − 2 tr(BRT V T ) . (3)
tical in absolute terms.
Because the projected data matrix V = XW is fixed, mini- Note that the ITQ algorithm has been inspired by the ap-
mizing (3) is equivalent to maximizing proach of Yu and Shi [22] for discretizing relaxed solutions
n X
c
to multi-class spectral clustering, which is based on finding
tr(BRT V T ) =
X
Bij Veij , an orthogonal transformation of the continuous eigenvec-
i=1 j=1
tors to bring them as close as possible to a discrete solution.
One important difference between [22] and our approach is
where Veij denote the elements of Ve = V R. To maximize that [22] allows discretization only to the c orthogonal hy-
this expression with respect to B, we need to have Bij = 1 percube vertices with exactly one positive entry, while we
whenever Veij ≥ 0 and −1 otherwise. In other words, B = use all the 2c vertices as targets. This enables us to learn ef-
sgn(V R) as claimed in the beginning of this section. ficient codes that preserve the locality structure of the data.
Note that scaling the original data X by a constant fac-
tor changes the additive and multiplicative constants in (3), 3. Evaluation of Unsupervised Code Learning
but does not affect the optimal value of B or R. Thus, while
our method requires the data to be zero-centered, it does not 3.1. Datasets
care at all about the scaling. In other words, the quantiza- We evaluate our method on two subsets of the Tiny Im-
tion formulation (2) makes sense regardless of whether the ages dataset [17]. Both of these subsets come from [4]. The
average magnitude of the feature vectors is compatible with first subset is a version of the CIFAR dataset [8], and it con-
the radius of the binary cube. sists of 64,800 images that have been manually grouped into
Fix B and update R. For a fixed B, the objective function 11 ground-truth classes (airplane, automobile, bird, boat,
(2) corresponds to the classic Orthogonal Procrustes prob- cat, deer, dog, frog, horse, ship and truck). The second,
lem [15], in which one tries to find a rotation to align one larger subset consists of 580,000 Tiny Images. Apart from
819
(a) Euclidean ground truth (b) Class label ground truth
0.5
Precision@500
PCA−Nonorth
0.25
0.3 0.2
SKLSH
0.15
0.1
mAP
SH 163264128
256 0.2
Number of bits
LSH
0.2
PCA−Direct
GIST L2 baseline 0.15
0.1
0 0.1
16 32 64 128 256 16 32 64 128 256
Number of bits Number of bits
Figure 3. Comparative evaluation on CIFAR dataset. (a) Performance is measured by mean average precision (mAP) for retrieval using top
50 Euclidean neighbors of each query point as true positives. Refer to Figure 4 for the complete recall-precision curves for the state-of-the-
art methods. (b) Performance is measured by the averaged precision of top p ranked images for each query where ground truth is defined
by semantic class labels. Refer to Figure 5 for the complete class label precision curves for the state-of-the-art methods.
the CIFAR images, which are included among the 580,000 that follow the basic hashing scheme H(X) = sgn(X W f ),
images, all the other images lack manually supplied ground where the projection matrix W is defined in different ways:
f
truth labels, but they come associated with one of 388 In-
ternet search keywords. In this section, we use the CIFAR 1. LSH: W f is a Gaussian random matrix [1]. Note that in
ground-truth labels to evaluate the semantic consistency of theory, this scheme has locality preserving guarantees
our codes, and in Section 4, we will use the “noisy” key- only for unit-norm vectors. While we do not normalize
word information associated with the remaining Tiny Im- our data to unit norm, we have found that it still works
ages to train a supervised linear embedding. well as long as the data is zero centered.
The original Tiny Images are 32 × 32 pixels. We repre- 2. PCA-Direct: W f is simply the matrix of top c PCA di-
sent them with grayscale GIST descriptors [11] computed rections. This baseline is included to show what hap-
at 8 orientations and 4 different scales, resulting in 320- pens when we do not rotate the PCA-projected data
dimensional feature vectors. Because our method (as well prior to quantization.
as many other state-of-the-art methods) cannot use more 3. PCA-RR: W f = W R, where W is the matrix of PCA
bits than the original dimension of the data, we limit our- directions and R is a random orthogonal matrix. This
selves to evaluating code sizes up to 256 bits. is the initialization of ITQ, as described in Section 2.2.
3.2. Protocols and Baseline Methods We also compare ITQ to three state-of-the-art methods us-
We follow two evaluation protocols widely used in re- ing code provided by the authors:
cent papers [12, 19, 21]. The first one is to evaluate perfor- 1. SH [21]: Spectral Hashing. This method is based
mance of nearest neighbor search using Euclidean neigh- on quantizing the values of analytical eigenfunctions
bors as ground truth. As in [12], a nominal threshold of the computed along PCA directions of the data.
average distance to the 50th nearest neighbor is used to de- 2. SKLSH [12]: This method is based on the random fea-
termine whether a database point returned for a given query tures mapping for approximating shift-invariant ker-
is considered a true positive. Then, based on the Euclidean nels [13]. In [12], this method is reported to outper-
ground truth, we compute the recall-precision curve and the form SH for code sizes larger than 64 bits. We use a
mean average precision (mAP), or the area under the re- Gaussian kernel with bandwidth set to the average dis-
call precision curve. Second, we evaluate the semantic con- tance to the 50th nearest neighbor as in [12].
sistency of codes produced by different methods by using 3. PCA-Nonorth [19]: Non-orthogonal relaxation of
class labels as ground truth. For this case, we report the av- PCA. This method is reported in [19] to outperform
eraged precision of top 500 ranked images for each query SH. Note that instead of using semi-supervised PCA as
as in [20]. For all experiments, we randomly select 1000 in [19], the evaluation of this section uses standard un-
points to serve as test queries. The remaining images form supervised PCA (a supervised embedding will be used
the training set on which the code parameters are learned, in Section 4).
as well as the database against which the queries are per-
formed. All the experiments reported in this paper are aver- Note that of all the six methods above, LSH and SKLSH are
aged over 5 random training/test partitions. the only ones that rely on randomized data-independent lin-
We compare our ITQ method to three baseline methods ear projections. All the other methods, including our PCA-
820
1 1 1 1
PCA−ITQ PCA−ITQ PCA−ITQ PCA−ITQ
PCA−RR PCA−RR PCA−RR PCA−RR
0.8 0.8 0.8 0.8
PCA−Nonorth PCA−Nonorth PCA−Nonorth PCA−Nonorth
SKLSH SKLSH SKLSH SKLSH
0.6 SH 0.6 SH 0.6 SH 0.6 SH
Precision
Precision
Precision
Precision
0.4 0.4 0.4 0.4
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall Recall Recall
(a) Recall precision curve@32 bits. (b) Recall precision curve@64 bits. (c) Recall precision curve@128 bits. (d) Recall precision curve@256 bits.
Figure 4. Comparison with state-of-the-art methods on CIFAR dataset using Euclidean neighbors as ground truth. Refer to Figure 3(a) for
a summary of the mean average precision of these curves as a function of code size.
Precision
Precision
Precision
0.3 0.3
0.3 0.3
0.2
0.2
0.2
0.2
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of top returned images Number of top returned images Number of top returned images Number of top returned images
(a) Class label precision@32 bits. (b) Class label precision@64 bits. (c) Class label precision@128 bits. (d) Class label precision@256 bits.
Figure 5. Comparison with state-of-the-art methods on CIFAR dataset using semantic labels as ground truth. Figure 3(b) shows the
summary plot of average precision as a function of code size.
RR and PCA-ITQ, use PCA (or a non-orthogonal relaxation a function of code size. It also shows the “upper bound”
of PCA) as an intermediate dimensionality reduction step. for the performance of any method, which is given by the
retrieval precision of the original uncompressed GIST fea-
3.3. Results on CIFAR Dataset tures with Euclidean distance. As in Figure 3(a), PCA-RR
Figure 3(a) compares all the methods based on their and PCA-ITQ outperform all the other methods, and PCA-
mean average precision for Euclidean neighbor ground ITQ has a small but consistent advantage over PCA-RR. By
truth. Perhaps surprisingly, the natural baseline for our 256 bits, the precision of PCA-ITQ approaches that of the
method, PCA-RR, already outperforms everything except uncompressed upper bound.
PCA-ITQ for most code sizes. The only exception is From Figure 3(b), we can also make some interesting ob-
SKLSH, which has a strongly upward trajectory and gets servations about the performance of the other methods. Un-
the best performance for a code size of 256. This be- like in Figure 3(a), PCA-Direct works relatively well for the
havior may be due to the theoretical convergence guaran- smallest code sizes (32 and 64 bits), while SKLSH works
tee of SKLSH that when enough bits are assigned, Ham- surprisingly poorly. This may be due to the fact that un-
ming distance between binary codes approximates distance like most of the other methods, SKLSH does not rely on
in the kernel space with high quality. LSH, which is data- PCA. Our results seem to indicate that PCA really helps to
independent just like SKLSH, also improves as the code preserve semantic consistency for the smallest code sizes.
size increases, and it almost reaches the performance level Even at 256 bits, while SKLSH had by far the best perfor-
of PCA-RR at 256 bits. As for our proposed PCA-ITQ mance for Euclidean neighbor retrieval, it lags behind most
method, it consistently performs better than PCA-RR, al- of the other methods in terms of class label precision. This
though the advantage becomes smaller as the code size in- underscores the fact that the best Euclidean neighbors are
creases. Thus, adapting to the data distribution seems espe- not necessarily the most semantically consistent, and that it
cially important when the code size is small. In particular, is important to apply dimensionality reduction to the data
doing the ITQ refinement for a 64-bit code raises its perfor- in order to capture the its class structure. Another observa-
mance almost to the level of the 256-bit PCA-RR code. tion worth making is that the two methods lacking a solid
Figure 3(b) evaluates the semantic consistency of the theoretical basis, namely PCA-Direct and SH, can actually
codes using class labels as ground truth. For each method, decrease in performance as the number of bits increases.
it shows retrieval precision for top 500 returned images as Figures 4 and 5 show complete recall-precision and class
821
1 1 1
PCA−ITQ PCA−ITQ PCA−ITQ
PCA−RR PCA−RR PCA−RR
0.5
0.8 0.8 0.8 PCA−Nonorth
PCA−Nonorth PCA−Nonorth
SKLSH
SKLSH SKLSH SH 0.4
0.6 SH 0.6 SH 0.6
Precision
Precision
Precision
mAP
0.3
822
Images subset comes with noisy keywords that have not
been verified by humans. These two different kinds of an-
0.4 notation allow us to explore the power of the CCA embed-
ding given both “clean” and “noisy” supervisory informa-
tion. For the “clean” scenario, we use a setup analogous to
Precision@500
823
Precision: 0.89 Precision: 0.75 Precision: 0.58 Precision: 0.5
(a) Query (b) CCA-ITQ. (c) CCA-RR. (d) PCA-ITQ. (e) PCA-RR.
Figure 8. Sample top retrieved images for query in (a) using 32 bits. Red rectangle denotes false positive. Best viewed in color.
3 http://www.unc.edu/∼yunchao/itq.htm.
824