Professional Documents
Culture Documents
ABSTRACT q q
We present a simple variant of the k-d tree which automat-
ically adapts to intrinsic low dimensional structure in data
without having to explicitly learn this structure.
1. INTRODUCTION
A k-d tree [4] is a spatial data structure that partitions RD
into hyperrectangular cells. It is built in a recursive manner,
splitting along one coordinate direction at a time (Figure 1,
left). The succession of splits corresponds to a binary tree
whose leaves contain the individual cells in RD .
These trees are among the most widely-used spatial par- Figure 1: Left: A spatial partitioning of R2 induced
titionings in machine learning and statistics. To understand by a k-d tree with three levels. The dots are data
their application, consider Figure 1(left), and suppose that points; the cross marks a query point q. Right: Par-
the dots are points in a database, while the cross is a query titioning induced by an RP tree.
point q. The cell containing q, henceforth denoted cell(q),
can quickly be identified by moving q down the tree. If the
tistical methods. However, a recent positive development in
diameter of cell(q) is small (where the diameter is taken to
machine learning has been the realization that a lot of data
mean the distance between the furthest pair of data points
which superficially lie in a very high-dimensional space RD ,
in the cell), then the points in it can be expected to have
actually have low intrinsic dimension, in the sense of lying
similar properties, for instance similar labels. In classifica-
close to a manifold of dimension d D. There has been
tion, q is assigned the majority label in its cell, or the label
significant interest in algorithms which learn this manifold
of its nearest neighbor in the cell. In regression, q is assigned
from data, with the intention that future data can then be
the average response value in its cell. In vector quantization,
transformed into this low-dimensional space, in which stan-
q is replaced by the mean of the data points in the cell. Nat-
dard methods will work well. This field is quite recent and
urally, the statistical theory around k-d trees is centered on
yet the literature on it is already voluminous; early founda-
the rate at which the diameter of individual cells drops as
tional work includes [24, 23, 3].
you move down the tree; for details, see page 320 of [8].
In this paper, we are interested in techniques that auto-
It is an empirical observation that the usefulness of k-d
matically adapt to intrinsic low dimensional structure with-
trees diminishes as the dimension D increases. This is easy
out having to explicitly learn this structure. The most ob-
to explain in terms of cell diameter; specifically, we will show
vious first question is, do k-d trees adapt to intrinsic low
that there is a data set in RD for which a k-d tree requires
dimension? The answer is no: the bad example mentioned
D levels in order to halve the cell diameter. In other words,
above has an intrinsic dimension of just O(log D). But we
if the data lie in R1000 , it could take 1000 levels of the tree
introduce a simple variant of k-d trees that does possess this
to bring the diameter of cells down to half that of the entire
property. Instead of splitting along coordinate directions at
data set. This would require 21000 data points!
the median, we split along a random direction in S D1 (the
Thus k-d trees are susceptible to the same curse of dimen-
unit sphere in RD ), and instead of splitting exactly at the
sionality that has been the bane of other nonparametric sta-
median, we add a small amount of jitter. We call these
random projection trees (Figure 1, right), or RP trees for
short, and we show the following.
Permission to make digital or hard copies of all or part of this work for Pick any cell C in the RP tree. If the data in C
personal or classroom use is granted without fee provided that copies are have intrinsic dimension d, then all descendant
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
cells d log d levels below will have at most half
republish, to post on servers or to redistribute to lists, requires prior specific the diameter of C.
permission and/or a fee.
STOC08, May 1720, 2008, Victoria, British Columbia, Canada. There is no dependence on the extrinsic dimensionality (D)
Copyright 2008 ACM 978-1-60558-047-0/08/05 ...$5.00. of the data.
2. DETAILED OVERVIEW To address some of these qualms, we introduce a statis-
In what follows, we always assume the data lie in R . D tically motivated notion of dimension: we say that a set S
has local covariance dimension (d, , r) if neighborhoods of
2.1 Low-dimensional manifolds radius r have (1) fraction of their variance concentrated in
a d-dimensional subspace. To make this precise, start by let-
The increasing ubiquity of massive, high-dimensional data
ting 12 , 22 , . . . , D
2
denote the eigenvalues of the covariance
sets has focused the attention of the statistics and machine
matrix; these are the variances in each of the eigenvector
learning communities on the curse of dimensionality. A large
directions.
part of this effort is based on exploiting the observation that
many high-dimensional data sets have low intrinsic dimen- Definition 2. Set S RD has local covariance dimen-
sion. This is a loosely defined notion, which is typically used sion (d, , r) if its restriction to any ball of radius r has co-
to mean that the data lie near a smooth low-dimensional variance matrix whose largest d eigenvalues satisfy 12 + +
manifold. d2 (1 ) (12 + + D2
).
For instance, suppose that you wish to create realistic an-
imations by collecting human motion data and then fitting The intuitions behind this notion have informed some of the
models to it. A common method for collecting motion data work on learning manifolds (for instance, [23]), but here we
is to have a person wear a skin-tight suit with high contrast formalize it for the first time.
reference points printed on it. Video cameras are used to
track the 3D trajectories of the reference points as the per- 2.3 k-d trees and RP trees
son is walking or running. In order to ensure good coverage, Both k-d trees and random projection (RP) trees are built
a typical suit has about N = 100 reference points. The posi- by recursive binary splits. They differ only in the nature of
tion and posture of the body at a particular point of time is the split, which we define in a subroutine ChooseRule. The
represented by a (3N )-dimensional vector. However, despite core tree-building algorithm is called MakeTree, and takes
this seeming high dimensionality, the number of degrees of as input a data set S RD .
freedom is small, corresponding to the dozen-or-so joint an-
gles in the body. The positions of the reference points are procedure MakeTree(S)
more or less deterministic functions of these joint angles. if |S| < M inSize return (Leaf )
To take another example, a speech signal is commonly Rule ChooseRule(S)
represented by a high-dimensional time series: the signal is Lef tT ree MakeTree({x S : Rule(x) = true})
broken into overlapping windows, and a variety of filters are RightT ree MakeTree({x S : Rule(x) = false})
applied within each window. Even richer representations return ([Rule, Lef tT ree, RightT ree])
can be obtained by using more filters, or by concatenating The k-d tree ChooseRule picks a coordinate direction (typ-
vectors corresponding to consecutive windows. Through all ically the coordinate with largest spread) and then splits the
this, the intrinsic dimensionality remains small, because the data on its median value for that coordinate.
system can be described by a few physical parameters de-
scribing the configuration of the speakers vocal apparatus. procedure ChooseRule(S)
comment: k-d tree version
2.2 Intrinsic dimensionality choose a coordinate direction i
In this paper we explore three definitions of intrinsic di- Rule(x) := xi median({zi : z S})
mension: Assouad dimension, manifold dimension, and local return (Rule)
covariance dimension.
Assouad (or doubling) dimension appeared in [2]. On the other hand, an RPTree chooses a direction uniformly
at random from the unit sphere S D1 and splits the data
Definition 1. For any point x RD and any r > 0, let into two roughly equal-sized sets using a hyperplane orthog-
B(x, r) = {z : kx zk r} denote the closed ball of radius onal to this direction. We describe two variants, which we
r centered at x. The Assouad dimension of S RD is the call RPTree-Max and RPTree-Mean. Both are adaptive to
smallest integer d such that for any ball B(x, r) RD , the intrinsic dimension, although the proofs are in different mod-
set B(x, r) S can be covered by 2d balls of radius r/2. els and use different techniques.
We start with the ChooseRule for RPTree-Max.
This definition has proved fruitful in recent work on embed-
dings of metric spaces [2, 18, 17]. To relate it to manifolds, procedure ChooseRule(S)
we show (Theorem 22) that the Assouad dimension of a d- comment: RPTree-Max version
dimensional Riemannian submanifold of RD is O(d), subject choose a random unit direction v RD
to a bound on the second fundamental form of the manifold. pick any x S; let y S be the farthest point fromit
Assouad dimension and manifold dimension have become choose uniformly at random in [1, 1] 6kx yk/ D
common currency in the computer science literature. Yet Rule(x) := x v (median({z v : z S}) + )
they arose in contexts very different from data analysis, and return (Rule)
it is not obvious that they are really the most appropriate
quantities for capturing the intrinsic dimensionality of data. (In this paper, k k always denotes Euclidean distance.) A
It is especially troubling that they seem quite resistant to tree of this kind, with boundaries that are arbitrary hyper-
empirical verification: given a sample of points drawn from planes, is generically called a binary space partition (BSP)
an underlying distribution P , it is not easy to check whether tree [13]. Our particular variant is built using two kinds of
P is concentrated near a low-dimensional manifold, or near randomness, in the split directions as well in the perturba-
a set of low Assouad dimension. tions. Both are crucial for the bounds we give.
The RPTree-Mean is similar to RPTree-Max, but differs in Consider S RSD made up of the coordinate axes between
a critical respect: it occasionally performs a different kind of 1 and 1: S = D i=1 {tei : 1 t 1}. Here e1 , . . . , eD
split, in which a cell is split into two pieces based on distance is the canonical basis of RD . There are many application
from the mean. domains, such as text, in which data is sparse; this example
procedure ChooseRule(S) is an extreme case.
comment: RPTree-Mean version S lies within B(0, 1) and can be covered by 2D balls of
radius 1/2. It is not hard to see that the Assouad dimension
if 2 (S) c 2A (S) of S is d = log 2D. On the other hand, a k-d tree would
choose a random unit direction v clearly need D levels before halving the diameter of its cells.
then
Rule(x) := x v median({z v : z S}) Thus k-d trees cannot be said to adapt to the intrinsic di-
Rule(x) := mensionality of data.
else
kx mean(S)k median{kz mean(S)k : z S}
return (Rule) 2.6 Connections to other work
In the code, c is a constant, (S) is the diameter of S (the Uses of k-d trees
distance between the two furthest points in the set), and As described in the introduction, the use of k-d trees for
A (S) is the average diameter, that is, the average distance classification, regression, near neighbor search, and vector
between points of S: quantization leads to rates of convergence that depend on
1 X the rate at which the diameter of cells decreases down the
2A (S) = kx yk2 .
|S|2 x,yS tree. Based on our results, RP trees might considerably
extend the scope of these methods, from data that is low
2.4 Main results dimensional to data that is just intrinsically low dimensional.
[12] contains experimental results in this direction.
Suppose an RP tree is built from a data set S RD , not A related problem is nearest neighbor search, for which k-d
necessarily finite. If the tree has k levels, then it partitions trees are commonly used. Here, the criterion governing the
the space into 2k cells. We define the radius of a cell C RD efficacy of search is harder to make precise. Interestingly,
to be the smallest r > 0 such that S C B(x, r) for some some state-of-the-art practical work on tree-based nearest
x C. Our first theorem gives an upper bound on the rate neighbor search [21] uses random projection as a prepro-
at which the radius of cells in an RPTree-Max decreases as cessing step. Another notable use of random projections in
one moves down the tree. this context is locality-sensitive hashing [14]. Also relevant
Theorem 3. There is a constant c1 with the following is work on other tree structures with complexity guarantees
property. Suppose an RPTree-Max is built using data set for nearest neighbor search [1, 20, 5]. It would be interesting
S RD . Pick any cell C in the RP tree; suppose that if similar guarantees could be shown for a data structure as
S C has Assouad dimension d. Then with probability at simple as ours.
least 1/2 (over the randomization in constructing the subtree
rooted at C), for every descendant C which is more than
Vector quantization
c1 d log d levels below C, we have radius(C ) radius(C)/2. Vector quantization [16] is a basic building block of lossy
data compression. Here, random vectors X are generated
from some distribution P over RD , and the goal is to pick a
Our next theorem gives a result for the second type of RP- finite codebook C RD and an encoding function : RD
Tree. In this case, we are able to quantify the improvement C such that EP kX (X)k2 is small.
per level, rather than amortized over levels. Recall that Ideally wed let (x) be the nearest neighbor of x in C,
an RPTree-Mean has two different types of splits; lets call but often (in audio or video compression) the number of
them splits by distance and splits by projection. codewords is so enormous that this nearest neighbor com-
Theorem 4. There are constants 0 < c1 , c2 , c3 < 1 with putation cannot be performed in real time. A more efficient
the following property. Suppose an RPTree-Mean is built scheme is to have the codewords arranged in a tree [7]: there
using data set S RD . Consider any cell C of radius r, is a partition of space like a k-d tree, and each x is mapped
such that SC has local covariance dimension (d, , r), where to the mean value in cell(x). Our Theorem 4 shows that the
< c1 . Pick a point x S C at random, and let C be the vector quantization error of RP trees behaves like eO(r/d) ,
cell that contains it at the next level down. where r is the depth of the tree and d is the intrinsic dimen-
sion. There is a substantial body of work that obtains rates
If C is split by distance, E [(S C )] c2 (S C). for vector quantization, and as one may expect, these turn
out to be of the form er/D [15].
If C is split by projection, then E 2A (S C )
2
(1 (c3 /d)) A (S C). Compressed sensing
In both cases, the expectation is over the randomization in The field of compressed sensing has grown out of the sur-
splitting C and the choice of x S C. prising realization that high-dimensional sparse data can be
accurately reconstructed from just a few random projections
2.5 A lower bound for k-d trees [6, 10]. The central premise of this research area is that the
Finally, we remark that this property of automatically original data thus never even needs to be collected: all one
adapting to intrinsic dimension does not hold for k-d trees. ever sees are the random projections.
The counterexample is very simple, and applies to any vari- RP trees are similar in spirit and entirely compatible with
ant of k-d trees that uses axis-aligned splits. this viewpoint. Theorem 4 holds even if the random pro-
3.1 Gross statistics of projected data
We choose random projections from RD to R by picking
U N (0, (1/D)ID ) (multivariate Gaussian) and mapping
x 7 x U . The key property of such a projection is that
it approximately preserves the lengths of vectors, modulo a
Bi scaling factor of D. This is summarized below.
bad
Bj Lemma 5. Fix any x RD . Pick a random vector U
N (0, (1/D)ID ). Then for any , > 0:
h i q
kxk 2
neutral (a) P |U x| D
; and
h i
kxk 2 2 /2
(b) P |U x|
D
e .
good
The sweet spot is the region between B e and B e ; if the 4. AN RPTREE-MEAN ADAPTS TO LOCAL
split point falls in it, the two balls will be cleanly separated. COVARIANCE DIMENSION
By properties(1) and (2), the length of this
sweet spot is An RPTree-Mean has two types of splits. If a cell C has
at least /(4 D) 2/(16 D) = /(8 D). Moreover, much larger diameter than average-diameter (that is, av-
by (3) and (4), erage interpoint distance), then it is split according to the
we know that its entirety must lie within
distance 3/ D ofx e0 (since both ze and ze do), and thus distances of points from the mean. Otherwise, a random
within distance 6/ D of median(S). e Thus, under the sam- projection is used.
pling strategy from the lemma statement, there is a constant
probability of hitting the sweet spot. Putting it all together,
4.1 Splitting by distance from the mean
This is invoked when the points in the current cell, call
P[B, B cleanly separated] them S, satisfy 2 (S) > c2A (S) (recall that (S) is the di-
P[U is good] P[B, B separated|U is good] ameter of S while 2A (S) is the average interpoint distance).
1 /(8 D) 1 Lemma 12. Suppose that 2 (S) > c2A (S). Let S1 be
= ,
2 12/ D 192 the points in S whose distance to mean(S) is less than or
equal to the median distance, and let S2 be the remaining
as claimed. points. Then the expected squared diameter after the split is
|S1 | 2 `
|S|
(S1 ) + |S 2|
|S|
2 (S2 ) 21 + 2c 2 (S).
We also upper bound the chance of a bad split (Figure 2).
For the final qualitative result, all that matters is that this
probability be strictly smaller than that of a good split.
4.2 Splitting by projection: proof outline
Suppose the current cell contains a set of points S RD
Lemma 10. Under the hypotheses of Lemma 9, with 2 (S) c2A (S). We show that a split by projection
has a constant probability of reducing the average squared
e B
P[B, e both intersect the split point] < 1/384. diameter 2A (S) by (2A (S)/d). Our proof has three parts:
I. Suppose S is split into S1 and S2 , with means 1 and
2 . Then the reduction in average diameter can be
3.3 Proof of Theorem 3 expressed in a remarkably simple form, as a multiple
Finally, we complete the proof of Theorem 3. of k1 2 k2 .
Lemma 11. Suppose S RD has Assouad dimension d. II. Next, we give a lower bound on the distance between
Pick any cell C in the RP tree; suppose it is contained in the projected means, (e 1 e2 )2 . We show that the
a ball of radius . Then the probability that there exists a distribution of the projected points is subgaussian with
descendant of C which is more than (d log d) levels below variance O(2A (S)/D). This well-behavedness implies
and yet has radius > /2 is at most 1/2. that (e
1 e2 )2 = (2A (S)/D).
III. We finish by showing that, approximately, k1 2 k2 4.4 Properties of the projected data
(D/d)(e1 e2 )2 . This is because 1 2 lies close to Projection from RD into R1 shrinks the average squared
the subspace spanned by the top d eigenvectors of the diameter of a data set by roughly D. To see this, we start
covariance matrix of S; and with high probability,
p ev- with the fact that when a data set with covariance A is
ery vector in this subspace shrinks by O( d/D) when projected onto a vector U , the projected data have variance
projected on a random line. U T AU . We then observe that for random U , such quadratic
forms are concentrated about their expected values.
We now tackle these three parts of the proof in order.
Lemma 16. Pick U N (0, (1/D)ID ). For any S RD ,
4.3 Quantifying the reduction in average di- with probability at least 1/10, the projection of S onto U has
ameter average squared diameter 2A (S U ) 2A (S)/(4D).
The average squared diameter 2A (S) has certain reformu- Proof. By Corollary 14,
lations that make it convenient to work with. These prop- 2 X
erties are consequences of the following two observations, 2A (S U ) = ((x mean(S)) U )2 = 2U T cov(S)U,
|S| xS
the first of which the reader may recognize as a standard
bias-variance decomposition of statistics. where cov(S) is the covariance of data set S. This quadratic
T
termPhas expectation (over the choice
P of U ) E[2U cov(S)U ]
Lemma 13. Let X, Y be independent and identically dis- = 2 i,j E[Ui Uj ]cov(S)ij = D i cov(S)ii = 2A (S)/D.
2
tributed random variables in Rn , and fix any z Rn .
Lemma 23(a) then bounds the concentration of U T cov(S)U
around its expected value.
(a) E kX zk2 = E kX EXk2 + kz EXk2 .
Next, we examine the overall distribution of the projected
(b) E kX Y k2 = 2 E kX EXk2 . points. When S RD has diameter , its projection into
the line can have diameter upto , but as we saw in Lemma 7,
most of it will lie within a central interval of size O(/ D).
Proof. Part (a) is immediate when both sides are ex-
Now we characterize the distribution more precisely.
panded. For (b),
we use part (a)
to assert that
for any fixed
y, we have E kX yk2 = E kX EXk2 + ky EXk2 . Lemma 17. Suppose S B(0, ) RD . Pick > 0
We then take expectation over Y = y. and U N (0, (1/D)ID ). With probability 1 , S U =
{x U : x S} satisfies the following property for all pos-
This can be used to show that the averaged squared diam- itive integers k: The fraction
of points outside the interval
2
eter, 2A (S), is twice the average squared distance of points k/ D, +k/ D is at most (2k /) ek /2 .
in S from their mean.
Proof. Apply Lemma 7 for each k (with failure proba-
Corollary 14. The average squared bility /2k ) and take a union bound.
P diameter of a set 2S
can also be written as: 2A (S) = |S|
2
xS kx mean(S)k . Finally, we examine what happens when a d-dimensional
linear subspace of RD is projected into R1 . We show a uni-
form bound over all vectors in the subspace.
Proof. 2A (S) is simply E kX Y k2 , when X, Y are
i.i.d. draws from the uniform distribution over S. Lemma 18. There exists > 0 with the following prop-
erty. Fix any > 0 and any d-dimensional subspace H
At each successive level of the tree, the current cell is RD . Pick U N (0, (1/D)ID ). Then with probability at least
split into two, either by a random projection or according to 1 over the choice of U ,
distance from the mean. Suppose the points in the current |x U |2 d + ln 1/
cell are S, and that they are split into sets S1 and S2 . It is sup .
xH kxk2 D
obvious that the expected diameter is nonincreasing:
|S1 | |S2 | Proof. Apply Lemma 6 to the intersection of H with
(S) (S1 ) + (S2 ).
|S| |S| the surface of the unit sphere in RD . This set has Assouad
dimension O(d).
This is also true of the expected average diameter. In fact,
we can precisely characterize how much it decreases on ac- 4.5 Distance between projected means
count of the split. We are dealing with the case when 2 (S) c2A (S), that
is, the diameter of set S is at most a constant factor times
Lemma 15. Suppose set S is partitioned (in any manner) the average interpoint distance. If S is projected onto a ran-
into S1 and S2 . Then dom direction, the projected points will have variance about
2A (S)/D, by Lemma 16; and by Lemma 17, it isnt too far
|S1 | 2 |S2 | 2
2A (S) A (S1 ) + A (S2 ) from the truth to think of these points as having roughly a
|S| |S|
Gaussian distribution. Thus, if the projected points are split
2|S1 | |S2 | into two groups at the mean, we would expect the means of
= kmean(S1 ) mean(S2 )k2 .
|S|2 these two groups to be separated by a distance of about
A (S)/ D. Indeed, this is the case. The same holds if we
split at the median, which isnt all that different from the
This completes part I of the proof outline. mean for close-to-Gaussian distributions.
Lemma 19. There is a constant 2 for which the following
holds. Pick any 0 < < 1/16c. Pick U N (0, (1/D)ID )
Proof. Assume without loss of generality that S has zero
and split S into two pieces:
mean. Let H be the subspace spanned by the top d eigen-
S1 = {x S : x U < s} and S2 = {x S : x U s}, vectors of cov(S), and let H be its orthogonal subspace.
Write any point x RD as xH + x , where each component
where s is either mean(S U ) or median(S U ). Write p =
is a vector in RD that lies in the respective subspace.
|S1 |/|S|, and let
e1 and e2 denote the means of S1 U and
Pick the random vector U ; with probability 1 it
S2 U , respectively. Then with probability at least 1/10 ,
satisfies the following two properties.
1 2 (S) 1 Property 1: For some constant > 0, for every x RD
(e e1 )2 2
2 2
A .
(p(1 p)) D c log(1/) d + ln 1/ d + ln 1/
|xH U |2 kxH k2 kxk2 .
D D
Proof. Let the r.v. X e be a uniform-random draw from
This holds (with probability 1 /2) by Lemma 18.
the projected points S U . Without loss of generality S
e = 0 and thus pe Property 2: Letting X be a uniform-random draw from S,
has mean 0, so EX 1 + (1 p)e2 = 0.
Rearranging, e1 = (1 p)(e2
e1 ) and e2 = p(e
2 e1 ). 2
EX [(X U )2 ] EU EX [(X U )2 ]
We already know from Lemma 16 (and Corollary 14) that
with probability at least 1/10, the variance of the projected 2
= EX EU [(X U )2 ]
e 2A (S)/8D. Well show this
points is significant: var(X)
implies a similar lower bound on (e2 e1 )2 . 2 2A (S)
= EX [kX k2 ] .
Using 1() to denote 01 indicator variables, for any t > 0, D D
e
var(X) e s)2 ]
E[(X The first step is Markovs inequality, and holds with prob-
ability 1 /2. The last inequality comes from the local
E[2t|X e s| t)2 1(|X
e s| + (|X e s| t)]
covariance condition.
This is convenient since the linear term gives us
e2
e1 : So assume the two properties hold. Writing 2 1 as
(2H 1H ) + (2 1 ),
e s|]
E[2t|X = 2t(p(s e1 ) + (1 p)(e2 s))
(e e1 )2
2 = ((2H 1H ) U + (2 1 ) U )2
= 4t p(1 p) (e
2
e1 ) + 2ts(2p 1).
2((2H 1H ) U )2 + 2((2 1 ) U )2 .
The last term vanishes since the split is either at the mean of
the projected points, in which case s = 0, or at the median, The first term can be bounded by Property 1:
in which case p = 1/2. p d + ln 1/
Next, well choose t = to ((S)/ D) log(1/) for some ((2H 1H ) U )2 k2 1 k2 .
D
suitable constant to , so that the quadratic term in var(X) e
For the second term, let EX denote expectation over X cho-
can be bounded using Lemma 17 and Corollary 8: with sen uniformly at random from S. Then
e t)2 1(|X|
probability at least 1 , E[(|X| e t)]
2
( (S)/D) (a simple integration). Putting things together, ((2 1 ) U )2
2(2 U )2 + 2(1 U )2
2A (S) e 4t p(1 p) (e 2 (S)
var(X) 2
e1 ) + . = 2(EX [X U | X S2 ])2 + 2(EX [X U | X S1 ])2
8D D
The result now follows immediately by algebraic manipula- 2EX [(X U )2 | X S2 ] + 2EX [(X U )2 | X S1 ]
tion, using the relation 2 (S) c2A (S). 2 2
EX [(X U )2 ] + EX [(X U )2 ]
1p p
4.6 Distance between high-dimensional means
2 2 2A (S)
Split S into two pieces as in the setting of Lemma 19, and = EX [(X U )2 ] .
let 1 and 2 denote the means of S1 and S2 , respectively. p(1 p) p(1 p) D
We already have a lower bound on the distance between the by Property 2. The lemma follows by putting the various
projected means, e2
e1 ; we will now show
p that k2 1 k pieces together.
is larger than this by a factor of about D/d. The main
technical difficulty here is the dependence between the i We can now finish off the proof of Theorem 4.
and the projection U . Incidentally, this is the only part of Theorem 21. Fix any O(1/c). Suppose set S RD
the entire argument that exploits intrinsic dimensionality. has the property that the top d eigenvalues of cov(S) account
Lemma 20. There is a constant 3 with the following prop- for more than 1 of its trace. Pick a random vector U
erty. Suppose set S RD is such that the top d eigenvalues N (0, (1/D)ID ) and split S into two parts,
of cov(S) account for more than 1 of its trace. Pick S1 = {x S : x U < s} and S2 = {x S : x U s},
a random vector U N (0, (1/D)ID ), and split S into two
pieces, S1 and S2 , in any fashion (which may depend upon where s is either mean(S U ) or median(S U ). Then with
U ). Let p = |S1 |/|S|. Let 1 and 2 be the means of S1 and probability (1), the expected average diameter shrinks by
S2 , and
e1 and e2 the means of S1 U and S2 U . Then for (2A (S)/cd).
any > 0, with probability 1 over the choice of U ,
Proof. By Lemma 15, the reduction in expected average
3 D 4 2A (S) diameter is 2p(1 p)k1 2 k2 , in the language of Lem-
k2 1 k2 (e e1 )2
2 .
d + ln 1/ p(1 p) D mas 19 and 20. The rest follows from those lemmas.
Acknowledgements
Dasgupta acknowledges the support of the National Science
Foundation under grants IIS-0347646 and IIS-0713540.
5. REFERENCES
[1] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and
A. Wu. An optimal algorithm for approximate nearest
neighbor searching. Journal of the ACM, 45:891923,
1998.
[2] P. Assouad. Plongements lipschitziens dans rn . Bull.
Soc. Math. France, 111(4):429448, 1983.
[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for Figure 3: Hilberts space filling curve: a 1D manifold
dimensionality reduction and data representation. that has Assouad dimension 2, when the radius of
Neural Computation, 15(6):13731396, 2003. balls is larger than the curvature of the manifold.
[4] J. Bentley. Multidimensional binary search trees used
for associative searching. Communications of the
ACM, 18(9):509517, 1975. [20] R. Krauthgamer and J. Lee. Navigating nets: simple
[5] A. Beygelzimer, S. Kakade, and J. Langford. Cover algorithms for proximity search. In ACM-SIAM
trees for nearest neighbor. In 23rd International Symposium on Discrete Algorithms, 2004.
Conference on Machine Learning, 2006. [21] T. Liu, A. Moore, A. Gray, and K. Yang. An
[6] E. Candes and T. Tao. Near optimal signal recovery investigation of practical approximate nearest
from random projections: universal encoding neighbor algorithms. In Neural Information Processing
strategies? IEEE Transactions on Information Systems, 2004.
Theory, 52(12):54065425, 2006. [22] P. Niyogi, S. Smale, and S. Weinberger. Finding the
[7] P. Chou, T. Lookabaugh, and R. Gray. Optimal homology of submanifolds with high confidence from
pruning with applications to tree-structured source random samples. Discrete and Computational
coding and modeling. IEEE Transactions on Geometry, 2006.
Information Theory, 35(2):299315, 1989. [23] S. Roweis and L. Saul. Nonlinear dimensionality
[8] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic reduction by locally linear embedding. Science,
Theory of Pattern Recognition. Springer, 1996. (290):23232326, 2000.
[9] M. do Carmo. Riemannian Geometry. Birkhauser, [24] J. Tenenbaum, V. de Silva, and J. Langford. A global
1992. geometric framework for nonlinear dimensionality
[10] D. Donoho. Compressed sensing. IEEE Transactions reduction. Science, 290(5500):23192323, 2000.
on Information Theory, 52(4):12891306, 2006.
[11] R. Durrett. Probability: Theory and Examples. APPENDIX
Duxbury, second edition, 1995.
[12] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. A. ASSOUAD DIMENSION OF A SMOOTH
Learning the structure of manifolds using random MANIFOLD
projections. In Neural Information Processing
If M is a d-dimensional Riemannian submanifold of RD ,
Systems, 2007.
what is its Assouad dimension? An easy case is when M is an
[13] H. Fuchs, Z. Kedem, and B. Naylor. On visible surface
affine set, in which case it has the same Assouad dimension
generation by a priori tree structures. Computer
as Rd , namely O(d). We may expect that for more general
Graphics, 14(3):124133, 1980.
M , the same holds true of small enough neighborhoods.
[14] A. Gionis, P. Indyk, and R. Motwani. Similarity search Recall that we define balls with respect to Euclidean dis-
in high dimensions via hashing. In 25th International tance in RD rather than geodesic distance on M . If a neigh-
Conference on Very Large Databases, 1999. borhood M B(x, r) has high curvature (speaking infor-
[15] S. Graf and H. Luschgy. Foundations of quantization mally), then it could potentially have large Assouad dimen-
for probability distributions. Springer, 2000. sion. For instance, it could be a 1-dimensional manifold and
[16] R. Gray and D. Neuhoff. Quantization. IEEE yet curve so much that (2D ) balls of radius r/2 are needed
Transactions on Information Theory, 44(6):23252383, to cover it (Figure 3). We therefore limit attention to mani-
1998. folds of bounded curvature, and to values of r small enough
[17] A. Gupta, R. Krauthgamer, and J. Lee. Bounded that the pieces of M in B(x, r) are relatively flat.
geometries, fractals, and low-distortion embeddings. In To formalize things, we need a handle on how curved the
IEEE Symposium on Foundations of Computer manifold M is locally. This is a relationship between the
Science, pages 534544, 2003. Riemannian metric on M and that of the space RD in which
[18] J. Heinonen. Lectures on Analysis on Metric Spaces. it is immersed, and is captured by the second fundamental
Springer, 2001. form (chapter 6 of [9]). For any point p M , this is a
[19] P. Indyk and A. Naor. Nearest neighbor preserving symmetric bilinear form B : Tp Tp Tp , where Tp denotes
embeddings. ACM Transactions on Algorithms, 3(3), the tangent space at p and Tp the normal space orthogonal
2007. to Tp . Our assumption on curvature is the following.
x0
Assumption. The norm of the second fundamen- level 0: 1 center
tal form is uniformly bounded by some 0; that
is, for all p M and unit norm Tp and u Tp ,
we have h,B(u,u)i
hu,ui
. centers of balls level 1: 2d centers
B. VARIOUS PROOFS where for the last step we observe 3/2 whenever d 1.
So with probability at least 1 , for all k, every edge
B.1 Proof of Lemma 5 between levels k and k +1 in the tree has projected length at
Since U has a Gaussian distribution, and any linear com- most (3/4)k / D. Thus every projected
` point in Se has
bination of independent Gaussians is a Gaussian, it follows a distance from x e0 of at most D 1 + 4 + ( 43 )2 + =
3