You are on page 1of 10

Random projection trees and low dimensional manifolds

Sanjoy Dasgupta Yoav Freund


UC San Diego UC San Diego
dasgupta@cs.ucsd.edu yfreund@cs.ucsd.edu

ABSTRACT q q
We present a simple variant of the k-d tree which automat-
ically adapts to intrinsic low dimensional structure in data
without having to explicitly learn this structure.

1. INTRODUCTION
A k-d tree [4] is a spatial data structure that partitions RD
into hyperrectangular cells. It is built in a recursive manner,
splitting along one coordinate direction at a time (Figure 1,
left). The succession of splits corresponds to a binary tree
whose leaves contain the individual cells in RD .
These trees are among the most widely-used spatial par- Figure 1: Left: A spatial partitioning of R2 induced
titionings in machine learning and statistics. To understand by a k-d tree with three levels. The dots are data
their application, consider Figure 1(left), and suppose that points; the cross marks a query point q. Right: Par-
the dots are points in a database, while the cross is a query titioning induced by an RP tree.
point q. The cell containing q, henceforth denoted cell(q),
can quickly be identified by moving q down the tree. If the
tistical methods. However, a recent positive development in
diameter of cell(q) is small (where the diameter is taken to
machine learning has been the realization that a lot of data
mean the distance between the furthest pair of data points
which superficially lie in a very high-dimensional space RD ,
in the cell), then the points in it can be expected to have
actually have low intrinsic dimension, in the sense of lying
similar properties, for instance similar labels. In classifica-
close to a manifold of dimension d D. There has been
tion, q is assigned the majority label in its cell, or the label
significant interest in algorithms which learn this manifold
of its nearest neighbor in the cell. In regression, q is assigned
from data, with the intention that future data can then be
the average response value in its cell. In vector quantization,
transformed into this low-dimensional space, in which stan-
q is replaced by the mean of the data points in the cell. Nat-
dard methods will work well. This field is quite recent and
urally, the statistical theory around k-d trees is centered on
yet the literature on it is already voluminous; early founda-
the rate at which the diameter of individual cells drops as
tional work includes [24, 23, 3].
you move down the tree; for details, see page 320 of [8].
In this paper, we are interested in techniques that auto-
It is an empirical observation that the usefulness of k-d
matically adapt to intrinsic low dimensional structure with-
trees diminishes as the dimension D increases. This is easy
out having to explicitly learn this structure. The most ob-
to explain in terms of cell diameter; specifically, we will show
vious first question is, do k-d trees adapt to intrinsic low
that there is a data set in RD for which a k-d tree requires
dimension? The answer is no: the bad example mentioned
D levels in order to halve the cell diameter. In other words,
above has an intrinsic dimension of just O(log D). But we
if the data lie in R1000 , it could take 1000 levels of the tree
introduce a simple variant of k-d trees that does possess this
to bring the diameter of cells down to half that of the entire
property. Instead of splitting along coordinate directions at
data set. This would require 21000 data points!
the median, we split along a random direction in S D1 (the
Thus k-d trees are susceptible to the same curse of dimen-
unit sphere in RD ), and instead of splitting exactly at the
sionality that has been the bane of other nonparametric sta-
median, we add a small amount of jitter. We call these
random projection trees (Figure 1, right), or RP trees for
short, and we show the following.
Permission to make digital or hard copies of all or part of this work for Pick any cell C in the RP tree. If the data in C
personal or classroom use is granted without fee provided that copies are have intrinsic dimension d, then all descendant
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
cells d log d levels below will have at most half
republish, to post on servers or to redistribute to lists, requires prior specific the diameter of C.
permission and/or a fee.
STOC08, May 1720, 2008, Victoria, British Columbia, Canada. There is no dependence on the extrinsic dimensionality (D)
Copyright 2008 ACM 978-1-60558-047-0/08/05 ...$5.00. of the data.
2. DETAILED OVERVIEW To address some of these qualms, we introduce a statis-
In what follows, we always assume the data lie in R . D tically motivated notion of dimension: we say that a set S
has local covariance dimension (d, , r) if neighborhoods of
2.1 Low-dimensional manifolds radius r have (1) fraction of their variance concentrated in
a d-dimensional subspace. To make this precise, start by let-
The increasing ubiquity of massive, high-dimensional data
ting 12 , 22 , . . . , D
2
denote the eigenvalues of the covariance
sets has focused the attention of the statistics and machine
matrix; these are the variances in each of the eigenvector
learning communities on the curse of dimensionality. A large
directions.
part of this effort is based on exploiting the observation that
many high-dimensional data sets have low intrinsic dimen- Definition 2. Set S RD has local covariance dimen-
sion. This is a loosely defined notion, which is typically used sion (d, , r) if its restriction to any ball of radius r has co-
to mean that the data lie near a smooth low-dimensional variance matrix whose largest d eigenvalues satisfy 12 + +
manifold. d2 (1 ) (12 + + D2
).
For instance, suppose that you wish to create realistic an-
imations by collecting human motion data and then fitting The intuitions behind this notion have informed some of the
models to it. A common method for collecting motion data work on learning manifolds (for instance, [23]), but here we
is to have a person wear a skin-tight suit with high contrast formalize it for the first time.
reference points printed on it. Video cameras are used to
track the 3D trajectories of the reference points as the per- 2.3 k-d trees and RP trees
son is walking or running. In order to ensure good coverage, Both k-d trees and random projection (RP) trees are built
a typical suit has about N = 100 reference points. The posi- by recursive binary splits. They differ only in the nature of
tion and posture of the body at a particular point of time is the split, which we define in a subroutine ChooseRule. The
represented by a (3N )-dimensional vector. However, despite core tree-building algorithm is called MakeTree, and takes
this seeming high dimensionality, the number of degrees of as input a data set S RD .
freedom is small, corresponding to the dozen-or-so joint an-
gles in the body. The positions of the reference points are procedure MakeTree(S)
more or less deterministic functions of these joint angles. if |S| < M inSize return (Leaf )
To take another example, a speech signal is commonly Rule ChooseRule(S)
represented by a high-dimensional time series: the signal is Lef tT ree MakeTree({x S : Rule(x) = true})
broken into overlapping windows, and a variety of filters are RightT ree MakeTree({x S : Rule(x) = false})
applied within each window. Even richer representations return ([Rule, Lef tT ree, RightT ree])
can be obtained by using more filters, or by concatenating The k-d tree ChooseRule picks a coordinate direction (typ-
vectors corresponding to consecutive windows. Through all ically the coordinate with largest spread) and then splits the
this, the intrinsic dimensionality remains small, because the data on its median value for that coordinate.
system can be described by a few physical parameters de-
scribing the configuration of the speakers vocal apparatus. procedure ChooseRule(S)
comment: k-d tree version
2.2 Intrinsic dimensionality choose a coordinate direction i
In this paper we explore three definitions of intrinsic di- Rule(x) := xi median({zi : z S})
mension: Assouad dimension, manifold dimension, and local return (Rule)
covariance dimension.
Assouad (or doubling) dimension appeared in [2]. On the other hand, an RPTree chooses a direction uniformly
at random from the unit sphere S D1 and splits the data
Definition 1. For any point x RD and any r > 0, let into two roughly equal-sized sets using a hyperplane orthog-
B(x, r) = {z : kx zk r} denote the closed ball of radius onal to this direction. We describe two variants, which we
r centered at x. The Assouad dimension of S RD is the call RPTree-Max and RPTree-Mean. Both are adaptive to
smallest integer d such that for any ball B(x, r) RD , the intrinsic dimension, although the proofs are in different mod-
set B(x, r) S can be covered by 2d balls of radius r/2. els and use different techniques.
We start with the ChooseRule for RPTree-Max.
This definition has proved fruitful in recent work on embed-
dings of metric spaces [2, 18, 17]. To relate it to manifolds, procedure ChooseRule(S)
we show (Theorem 22) that the Assouad dimension of a d- comment: RPTree-Max version
dimensional Riemannian submanifold of RD is O(d), subject choose a random unit direction v RD
to a bound on the second fundamental form of the manifold. pick any x S; let y S be the farthest point fromit
Assouad dimension and manifold dimension have become choose uniformly at random in [1, 1] 6kx yk/ D
common currency in the computer science literature. Yet Rule(x) := x v (median({z v : z S}) + )
they arose in contexts very different from data analysis, and return (Rule)
it is not obvious that they are really the most appropriate
quantities for capturing the intrinsic dimensionality of data. (In this paper, k k always denotes Euclidean distance.) A
It is especially troubling that they seem quite resistant to tree of this kind, with boundaries that are arbitrary hyper-
empirical verification: given a sample of points drawn from planes, is generically called a binary space partition (BSP)
an underlying distribution P , it is not easy to check whether tree [13]. Our particular variant is built using two kinds of
P is concentrated near a low-dimensional manifold, or near randomness, in the split directions as well in the perturba-
a set of low Assouad dimension. tions. Both are crucial for the bounds we give.
The RPTree-Mean is similar to RPTree-Max, but differs in Consider S RSD made up of the coordinate axes between
a critical respect: it occasionally performs a different kind of 1 and 1: S = D i=1 {tei : 1 t 1}. Here e1 , . . . , eD
split, in which a cell is split into two pieces based on distance is the canonical basis of RD . There are many application
from the mean. domains, such as text, in which data is sparse; this example
procedure ChooseRule(S) is an extreme case.
comment: RPTree-Mean version S lies within B(0, 1) and can be covered by 2D balls of
radius 1/2. It is not hard to see that the Assouad dimension
if 2 (S) c 2A (S) of S is d = log 2D. On the other hand, a k-d tree would
choose a random unit direction v clearly need D levels before halving the diameter of its cells.
then
Rule(x) := x v median({z v : z S}) Thus k-d trees cannot be said to adapt to the intrinsic di-
Rule(x) := mensionality of data.
else
kx mean(S)k median{kz mean(S)k : z S}
return (Rule) 2.6 Connections to other work
In the code, c is a constant, (S) is the diameter of S (the Uses of k-d trees
distance between the two furthest points in the set), and As described in the introduction, the use of k-d trees for
A (S) is the average diameter, that is, the average distance classification, regression, near neighbor search, and vector
between points of S: quantization leads to rates of convergence that depend on
1 X the rate at which the diameter of cells decreases down the
2A (S) = kx yk2 .
|S|2 x,yS tree. Based on our results, RP trees might considerably
extend the scope of these methods, from data that is low
2.4 Main results dimensional to data that is just intrinsically low dimensional.
[12] contains experimental results in this direction.
Suppose an RP tree is built from a data set S RD , not A related problem is nearest neighbor search, for which k-d
necessarily finite. If the tree has k levels, then it partitions trees are commonly used. Here, the criterion governing the
the space into 2k cells. We define the radius of a cell C RD efficacy of search is harder to make precise. Interestingly,
to be the smallest r > 0 such that S C B(x, r) for some some state-of-the-art practical work on tree-based nearest
x C. Our first theorem gives an upper bound on the rate neighbor search [21] uses random projection as a prepro-
at which the radius of cells in an RPTree-Max decreases as cessing step. Another notable use of random projections in
one moves down the tree. this context is locality-sensitive hashing [14]. Also relevant
Theorem 3. There is a constant c1 with the following is work on other tree structures with complexity guarantees
property. Suppose an RPTree-Max is built using data set for nearest neighbor search [1, 20, 5]. It would be interesting
S RD . Pick any cell C in the RP tree; suppose that if similar guarantees could be shown for a data structure as
S C has Assouad dimension d. Then with probability at simple as ours.
least 1/2 (over the randomization in constructing the subtree
rooted at C), for every descendant C which is more than
Vector quantization
c1 d log d levels below C, we have radius(C ) radius(C)/2. Vector quantization [16] is a basic building block of lossy
data compression. Here, random vectors X are generated
from some distribution P over RD , and the goal is to pick a
Our next theorem gives a result for the second type of RP- finite codebook C RD and an encoding function : RD
Tree. In this case, we are able to quantify the improvement C such that EP kX (X)k2 is small.
per level, rather than amortized over levels. Recall that Ideally wed let (x) be the nearest neighbor of x in C,
an RPTree-Mean has two different types of splits; lets call but often (in audio or video compression) the number of
them splits by distance and splits by projection. codewords is so enormous that this nearest neighbor com-
Theorem 4. There are constants 0 < c1 , c2 , c3 < 1 with putation cannot be performed in real time. A more efficient
the following property. Suppose an RPTree-Mean is built scheme is to have the codewords arranged in a tree [7]: there
using data set S RD . Consider any cell C of radius r, is a partition of space like a k-d tree, and each x is mapped
such that SC has local covariance dimension (d, , r), where to the mean value in cell(x). Our Theorem 4 shows that the
< c1 . Pick a point x S C at random, and let C be the vector quantization error of RP trees behaves like eO(r/d) ,
cell that contains it at the next level down. where r is the depth of the tree and d is the intrinsic dimen-
sion. There is a substantial body of work that obtains rates
If C is split by distance, E [(S C )] c2 (S C). for vector quantization, and as one may expect, these turn
out to be of the form er/D [15].
If C is split by projection, then E 2A (S C )
2
(1 (c3 /d)) A (S C). Compressed sensing
In both cases, the expectation is over the randomization in The field of compressed sensing has grown out of the sur-
splitting C and the choice of x S C. prising realization that high-dimensional sparse data can be
accurately reconstructed from just a few random projections
2.5 A lower bound for k-d trees [6, 10]. The central premise of this research area is that the
Finally, we remark that this property of automatically original data thus never even needs to be collected: all one
adapting to intrinsic dimension does not hold for k-d trees. ever sees are the random projections.
The counterexample is very simple, and applies to any vari- RP trees are similar in spirit and entirely compatible with
ant of k-d trees that uses axis-aligned splits. this viewpoint. Theorem 4 holds even if the random pro-
3.1 Gross statistics of projected data
We choose random projections from RD to R by picking
U N (0, (1/D)ID ) (multivariate Gaussian) and mapping
x 7 x U . The key property of such a projection is that
it approximately preserves the lengths of vectors, modulo a
Bi scaling factor of D. This is summarized below.
bad
Bj Lemma 5. Fix any x RD . Pick a random vector U
N (0, (1/D)ID ). Then for any , > 0:
h i q
kxk 2
neutral (a) P |U x| D

; and
h i
kxk 2 2 /2
(b) P |U x|
D

e .
good

Recall that the split rule looks at a random projection of


Figure 2: Cell C of the RP tree is contained in a the data and then splits it approximately at the median. The
ball of radius . Balls Bi and Bjare part of a cover perturbation added to the median depends on the diameter
of this cell, and have radius / d. For this pair of of the space.
balls, there are three kinds of splits: good, bad, and Suppose S RD has Assouad dimension d. Let Se = S U
be its random projection into R. How does diam(S) e compare
neutral.
to diam(S)? (Here diam(S) = supx,yS kx yk.) Clearly
diam(S)e kU k diam(S), but we would in fact expect it to
jections are forced to be the same across each entire level e diam(S)
be pmuch smaller if d D. In fact, diam(S)
of the tree. For a tree of depth k, this means only k ran-
dom projections are ever needed, and these can be computed O( d/D); the following is adapted from an argument due
beforehand (the split-by-distance can be reworked to oper- to [19].
ate in the projected space rather than the high-dimensional Lemma 6. Suppose set S RD is contained in a ball
space). The data are not accessed in any other way. B(x0 , ) and has Assouad dimension d. Let Se denote the
random projection of S into R. Then for any 0 < < 1,
3. AN RPTREE-MAX ADAPTS TO ASSOUAD with probability > 1 over the q choice of projection, Se lies
`
DIMENSION in an interval of radius 4 D 2 d + ln 2 centered at x e0 .
In this section, we prove Theorem 3. A rough outline
is as follows. Suppose an RP tree is built using data set
Thus,
p S projects to an interval in R of radius at most
S RD of Assouad dimension d, and that C is some cell of
O( d/D). In fact, most of the projected points will be
the tree. If S C lies in a ball of radius , then we need
to show that after O(d log d) further levels of splitting, each even closer together, in a central interval of size O(/ D).
resulting cell is contained in a ball of radius /2. To this Lemma 7. Suppose S RD lies within ball B(x0 , ).
end, we start
by covering S C with balls B1 , B2 , . . . , BN of Pick any 0 < , 1 such that 1/e2 . Let be any
radius / d. The Assouad dimension tells us N = O(dd/2 ) measure on S. Then with probability > 1 over the choice
suffices. Well show that iftwo balls Bi and Bj are more of random projection onto R, allqbut an fraction of Se (in
than a distance (/2)(/ d) apart, then a single random 1
-measure) lies within distance 2 ln of x
e0 .
projection (with jittered split) has a constant probability of D
cleanly separating them, in the sense that Bi and Bj will It follows that the median of the projected points also lies in
lie entirely on opposite sides of the split. There are at most this central interval; take to be the uniform distribution
N 2 such pairs i, j, so after (d log d) projections every one over S and use = 1/2.
of these pairs will have been split. Thus, (d log d) levels
below C in the tree, each cell will only contain points from Corollary 8. Under the hypotheses of Lemma 7, for
balls Bi which are within distance (/2) (/ d) of each any 0 < < 2/e2 , with probability at least 1 q
over the
other. Hence the radius of these cells will be /2. e x
choice of projection: |median(S) e0 | D 2 ln 2 .
Returning to Bi and Bj , we say that a split is good if
it completely separates them. There are also bad splits, in
which the split point intersects both the balls. The remaining
3.2 The probability of good and bad splits
splits are neutral (Figure 2). Most of our proof consists in We now get to the main lemma, which gives a lower bound
showing that good splits are more likely than bad ones. on the probability of a good split (recall Figure 2).
To lower-bound the probability of a good split, let B ei
Lemma 9. Say S B(x0 , ) has Assouad dimension d
and B ej be the projections of Bi and Bj onto a random line. 1. Pick balls B = B(z, r) and B = B(z , r) such that
We show that with constant probability the following events
occur: (1) Bei and Bej have a certain amount of space between their centers z and z lie in B(x0 , ),
them. (2) The median of the projected data lies very close the distance between these centers is kz z k 21 r,
to this space. (3) Picking a split point at random near the
median will separate B ei from Bej . and the radius r is at most /(512 d).
Now pick a random projection U , that sends S to (say) Proof. Suppose S C B(x0 , ). Cover this set by
Se R, and then pick a split point at random in the range balls of radius r = /(512 d); the Assouad dimension tells
median(S)e (6/ D). With probability at least 1/192 over us that N = (O(d))d balls suffice. Now, fix any pair of balls
the choice of U and the split point, S B and S B will be B, B from this cover whose centers are at distance at least
contained in separate halves of the split. /2r from one another; and, for k = 1, 2, . . ., let pk be the
probability that there is some cell k levels below C which
Proof. (Sketch.) Let B e and B e be the projections of contains points from both B and B .

S B and S B . It follows from Lemmas 5 and 6 and By Lemma 9, p1 191/192. To express pk in terms of
Corollary 8 that with probability at least 1/2, the random pk1 , think of the randomness in the subtree rooted at C
projection U will satisfy the following properties: as having two parts: the randomness in splitting cell C,
and the rest of the randomness (for each of the two induced
e and B
1. B e are contained within intervals of radius at
subtrees). Lemmas 9 and 10 then tell us that
most /(16 D) around ze and ze , respectively.
pk P[top split cleanly separates B from B ] 0 +

z ze | /(4 D).
2. |e P[top split intersects both B and B ] 2pk1 +
P[all other split configurations] pk1
3. ze and ze both lie within distance 3/ D of x
e0 .
1 1 1 1
0+ 2pk1 + 1 pk1
4. The median of Se lies within distance 3/ D of x
e0 . 192 384 192 384

1
In this case, we say U is good, and the following picture = 1 pk1 .
384
of Se is valid:

The three cases in the first inequality correspond to good,
(/ D) bad, and neutral splits at C (Figure 2). It follows that for
e
median(S)
e
B e
B some constant c and k = c d log d, we have pk 1/N 2 .
To finish up, take a union bound over all faraway pairs of
balls from the cover.
O(/ D)

The sweet spot is the region between B e and B e ; if the 4. AN RPTREE-MEAN ADAPTS TO LOCAL
split point falls in it, the two balls will be cleanly separated. COVARIANCE DIMENSION
By properties(1) and (2), the length of this
sweet spot is An RPTree-Mean has two types of splits. If a cell C has
at least /(4 D) 2/(16 D) = /(8 D). Moreover, much larger diameter than average-diameter (that is, av-
by (3) and (4), erage interpoint distance), then it is split according to the
we know that its entirety must lie within
distance 3/ D ofx e0 (since both ze and ze do), and thus distances of points from the mean. Otherwise, a random
within distance 6/ D of median(S). e Thus, under the sam- projection is used.
pling strategy from the lemma statement, there is a constant
probability of hitting the sweet spot. Putting it all together,
4.1 Splitting by distance from the mean
This is invoked when the points in the current cell, call
P[B, B cleanly separated] them S, satisfy 2 (S) > c2A (S) (recall that (S) is the di-
P[U is good] P[B, B separated|U is good] ameter of S while 2A (S) is the average interpoint distance).

1 /(8 D) 1 Lemma 12. Suppose that 2 (S) > c2A (S). Let S1 be
= ,
2 12/ D 192 the points in S whose distance to mean(S) is less than or
equal to the median distance, and let S2 be the remaining
as claimed. points. Then the expected squared diameter after the split is
|S1 | 2 `
|S|
(S1 ) + |S 2|
|S|
2 (S2 ) 21 + 2c 2 (S).
We also upper bound the chance of a bad split (Figure 2).
For the final qualitative result, all that matters is that this
probability be strictly smaller than that of a good split.
4.2 Splitting by projection: proof outline
Suppose the current cell contains a set of points S RD
Lemma 10. Under the hypotheses of Lemma 9, with 2 (S) c2A (S). We show that a split by projection
has a constant probability of reducing the average squared
e B
P[B, e both intersect the split point] < 1/384. diameter 2A (S) by (2A (S)/d). Our proof has three parts:
I. Suppose S is split into S1 and S2 , with means 1 and
2 . Then the reduction in average diameter can be
3.3 Proof of Theorem 3 expressed in a remarkably simple form, as a multiple
Finally, we complete the proof of Theorem 3. of k1 2 k2 .

Lemma 11. Suppose S RD has Assouad dimension d. II. Next, we give a lower bound on the distance between
Pick any cell C in the RP tree; suppose it is contained in the projected means, (e 1 e2 )2 . We show that the
a ball of radius . Then the probability that there exists a distribution of the projected points is subgaussian with
descendant of C which is more than (d log d) levels below variance O(2A (S)/D). This well-behavedness implies
and yet has radius > /2 is at most 1/2. that (e
1 e2 )2 = (2A (S)/D).
III. We finish by showing that, approximately, k1 2 k2 4.4 Properties of the projected data
(D/d)(e1 e2 )2 . This is because 1 2 lies close to Projection from RD into R1 shrinks the average squared
the subspace spanned by the top d eigenvectors of the diameter of a data set by roughly D. To see this, we start
covariance matrix of S; and with high probability,
p ev- with the fact that when a data set with covariance A is
ery vector in this subspace shrinks by O( d/D) when projected onto a vector U , the projected data have variance
projected on a random line. U T AU . We then observe that for random U , such quadratic
forms are concentrated about their expected values.
We now tackle these three parts of the proof in order.
Lemma 16. Pick U N (0, (1/D)ID ). For any S RD ,
4.3 Quantifying the reduction in average di- with probability at least 1/10, the projection of S onto U has
ameter average squared diameter 2A (S U ) 2A (S)/(4D).
The average squared diameter 2A (S) has certain reformu- Proof. By Corollary 14,
lations that make it convenient to work with. These prop- 2 X
erties are consequences of the following two observations, 2A (S U ) = ((x mean(S)) U )2 = 2U T cov(S)U,
|S| xS
the first of which the reader may recognize as a standard
bias-variance decomposition of statistics. where cov(S) is the covariance of data set S. This quadratic
T
termPhas expectation (over the choice
P of U ) E[2U cov(S)U ]
Lemma 13. Let X, Y be independent and identically dis- = 2 i,j E[Ui Uj ]cov(S)ij = D i cov(S)ii = 2A (S)/D.
2
tributed random variables in Rn , and fix any z Rn .
Lemma 23(a) then bounds the concentration of U T cov(S)U
around its expected value.
(a) E kX zk2 = E kX EXk2 + kz EXk2 .
Next, we examine the overall distribution of the projected
(b) E kX Y k2 = 2 E kX EXk2 . points. When S RD has diameter , its projection into
the line can have diameter upto , but as we saw in Lemma 7,
most of it will lie within a central interval of size O(/ D).
Proof. Part (a) is immediate when both sides are ex-
Now we characterize the distribution more precisely.
panded. For (b),
we use part (a)
to assert that
for any fixed
y, we have E kX yk2 = E kX EXk2 + ky EXk2 . Lemma 17. Suppose S B(0, ) RD . Pick > 0
We then take expectation over Y = y. and U N (0, (1/D)ID ). With probability 1 , S U =
{x U : x S} satisfies the following property for all pos-
This can be used to show that the averaged squared diam- itive integers k: The fraction
of points outside the interval
2
eter, 2A (S), is twice the average squared distance of points k/ D, +k/ D is at most (2k /) ek /2 .
in S from their mean.
Proof. Apply Lemma 7 for each k (with failure proba-
Corollary 14. The average squared bility /2k ) and take a union bound.
P diameter of a set 2S
can also be written as: 2A (S) = |S|
2
xS kx mean(S)k . Finally, we examine what happens when a d-dimensional
linear subspace of RD is projected into R1 . We show a uni-
form bound over all vectors in the subspace.
Proof. 2A (S) is simply E kX Y k2 , when X, Y are
i.i.d. draws from the uniform distribution over S. Lemma 18. There exists > 0 with the following prop-
erty. Fix any > 0 and any d-dimensional subspace H
At each successive level of the tree, the current cell is RD . Pick U N (0, (1/D)ID ). Then with probability at least
split into two, either by a random projection or according to 1 over the choice of U ,
distance from the mean. Suppose the points in the current |x U |2 d + ln 1/
cell are S, and that they are split into sets S1 and S2 . It is sup .
xH kxk2 D
obvious that the expected diameter is nonincreasing:
|S1 | |S2 | Proof. Apply Lemma 6 to the intersection of H with
(S) (S1 ) + (S2 ).
|S| |S| the surface of the unit sphere in RD . This set has Assouad
dimension O(d).
This is also true of the expected average diameter. In fact,
we can precisely characterize how much it decreases on ac- 4.5 Distance between projected means
count of the split. We are dealing with the case when 2 (S) c2A (S), that
is, the diameter of set S is at most a constant factor times
Lemma 15. Suppose set S is partitioned (in any manner) the average interpoint distance. If S is projected onto a ran-
into S1 and S2 . Then dom direction, the projected points will have variance about
2A (S)/D, by Lemma 16; and by Lemma 17, it isnt too far
|S1 | 2 |S2 | 2
2A (S) A (S1 ) + A (S2 ) from the truth to think of these points as having roughly a
|S| |S|
Gaussian distribution. Thus, if the projected points are split
2|S1 | |S2 | into two groups at the mean, we would expect the means of
= kmean(S1 ) mean(S2 )k2 .
|S|2 these two groups to be separated by a distance of about
A (S)/ D. Indeed, this is the case. The same holds if we
split at the median, which isnt all that different from the
This completes part I of the proof outline. mean for close-to-Gaussian distributions.
Lemma 19. There is a constant 2 for which the following
holds. Pick any 0 < < 1/16c. Pick U N (0, (1/D)ID )
Proof. Assume without loss of generality that S has zero
and split S into two pieces:
mean. Let H be the subspace spanned by the top d eigen-
S1 = {x S : x U < s} and S2 = {x S : x U s}, vectors of cov(S), and let H be its orthogonal subspace.
Write any point x RD as xH + x , where each component
where s is either mean(S U ) or median(S U ). Write p =
is a vector in RD that lies in the respective subspace.
|S1 |/|S|, and let
e1 and e2 denote the means of S1 U and
Pick the random vector U ; with probability 1 it
S2 U , respectively. Then with probability at least 1/10 ,
satisfies the following two properties.
1 2 (S) 1 Property 1: For some constant > 0, for every x RD
(e e1 )2 2
2 2
A .
(p(1 p)) D c log(1/) d + ln 1/ d + ln 1/
|xH U |2 kxH k2 kxk2 .
D D
Proof. Let the r.v. X e be a uniform-random draw from
This holds (with probability 1 /2) by Lemma 18.
the projected points S U . Without loss of generality S
e = 0 and thus pe Property 2: Letting X be a uniform-random draw from S,
has mean 0, so EX 1 + (1 p)e2 = 0.
Rearranging, e1 = (1 p)(e2
e1 ) and e2 = p(e
2 e1 ). 2
EX [(X U )2 ] EU EX [(X U )2 ]
We already know from Lemma 16 (and Corollary 14) that
with probability at least 1/10, the variance of the projected 2
= EX EU [(X U )2 ]
e 2A (S)/8D. Well show this
points is significant: var(X)
implies a similar lower bound on (e2 e1 )2 . 2 2A (S)
= EX [kX k2 ] .
Using 1() to denote 01 indicator variables, for any t > 0, D D
e
var(X) e s)2 ]
E[(X The first step is Markovs inequality, and holds with prob-
ability 1 /2. The last inequality comes from the local
E[2t|X e s| t)2 1(|X
e s| + (|X e s| t)]
covariance condition.
This is convenient since the linear term gives us
e2
e1 : So assume the two properties hold. Writing 2 1 as
(2H 1H ) + (2 1 ),
e s|]
E[2t|X = 2t(p(s e1 ) + (1 p)(e2 s))
(e e1 )2
2 = ((2H 1H ) U + (2 1 ) U )2
= 4t p(1 p) (e
2
e1 ) + 2ts(2p 1).
2((2H 1H ) U )2 + 2((2 1 ) U )2 .
The last term vanishes since the split is either at the mean of
the projected points, in which case s = 0, or at the median, The first term can be bounded by Property 1:
in which case p = 1/2. p d + ln 1/
Next, well choose t = to ((S)/ D) log(1/) for some ((2H 1H ) U )2 k2 1 k2 .
D
suitable constant to , so that the quadratic term in var(X) e
For the second term, let EX denote expectation over X cho-
can be bounded using Lemma 17 and Corollary 8: with sen uniformly at random from S. Then
e t)2 1(|X|
probability at least 1 , E[(|X| e t)]
2
( (S)/D) (a simple integration). Putting things together, ((2 1 ) U )2
2(2 U )2 + 2(1 U )2
2A (S) e 4t p(1 p) (e 2 (S)
var(X) 2
e1 ) + . = 2(EX [X U | X S2 ])2 + 2(EX [X U | X S1 ])2
8D D
The result now follows immediately by algebraic manipula- 2EX [(X U )2 | X S2 ] + 2EX [(X U )2 | X S1 ]
tion, using the relation 2 (S) c2A (S). 2 2
EX [(X U )2 ] + EX [(X U )2 ]
1p p
4.6 Distance between high-dimensional means
2 2 2A (S)
Split S into two pieces as in the setting of Lemma 19, and = EX [(X U )2 ] .
let 1 and 2 denote the means of S1 and S2 , respectively. p(1 p) p(1 p) D
We already have a lower bound on the distance between the by Property 2. The lemma follows by putting the various
projected means, e2
e1 ; we will now show
p that k2 1 k pieces together.
is larger than this by a factor of about D/d. The main
technical difficulty here is the dependence between the i We can now finish off the proof of Theorem 4.
and the projection U . Incidentally, this is the only part of Theorem 21. Fix any O(1/c). Suppose set S RD
the entire argument that exploits intrinsic dimensionality. has the property that the top d eigenvalues of cov(S) account
Lemma 20. There is a constant 3 with the following prop- for more than 1 of its trace. Pick a random vector U
erty. Suppose set S RD is such that the top d eigenvalues N (0, (1/D)ID ) and split S into two parts,
of cov(S) account for more than 1 of its trace. Pick S1 = {x S : x U < s} and S2 = {x S : x U s},
a random vector U N (0, (1/D)ID ), and split S into two
pieces, S1 and S2 , in any fashion (which may depend upon where s is either mean(S U ) or median(S U ). Then with
U ). Let p = |S1 |/|S|. Let 1 and 2 be the means of S1 and probability (1), the expected average diameter shrinks by
S2 , and
e1 and e2 the means of S1 U and S2 U . Then for (2A (S)/cd).
any > 0, with probability 1 over the choice of U ,
Proof. By Lemma 15, the reduction in expected average
3 D 4 2A (S) diameter is 2p(1 p)k1 2 k2 , in the language of Lem-
k2 1 k2 (e e1 )2
2 .
d + ln 1/ p(1 p) D mas 19 and 20. The rest follows from those lemmas.
Acknowledgements
Dasgupta acknowledges the support of the National Science
Foundation under grants IIS-0347646 and IIS-0713540.

5. REFERENCES
[1] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and
A. Wu. An optimal algorithm for approximate nearest
neighbor searching. Journal of the ACM, 45:891923,
1998.
[2] P. Assouad. Plongements lipschitziens dans rn . Bull.
Soc. Math. France, 111(4):429448, 1983.
[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for Figure 3: Hilberts space filling curve: a 1D manifold
dimensionality reduction and data representation. that has Assouad dimension 2, when the radius of
Neural Computation, 15(6):13731396, 2003. balls is larger than the curvature of the manifold.
[4] J. Bentley. Multidimensional binary search trees used
for associative searching. Communications of the
ACM, 18(9):509517, 1975. [20] R. Krauthgamer and J. Lee. Navigating nets: simple
[5] A. Beygelzimer, S. Kakade, and J. Langford. Cover algorithms for proximity search. In ACM-SIAM
trees for nearest neighbor. In 23rd International Symposium on Discrete Algorithms, 2004.
Conference on Machine Learning, 2006. [21] T. Liu, A. Moore, A. Gray, and K. Yang. An
[6] E. Candes and T. Tao. Near optimal signal recovery investigation of practical approximate nearest
from random projections: universal encoding neighbor algorithms. In Neural Information Processing
strategies? IEEE Transactions on Information Systems, 2004.
Theory, 52(12):54065425, 2006. [22] P. Niyogi, S. Smale, and S. Weinberger. Finding the
[7] P. Chou, T. Lookabaugh, and R. Gray. Optimal homology of submanifolds with high confidence from
pruning with applications to tree-structured source random samples. Discrete and Computational
coding and modeling. IEEE Transactions on Geometry, 2006.
Information Theory, 35(2):299315, 1989. [23] S. Roweis and L. Saul. Nonlinear dimensionality
[8] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic reduction by locally linear embedding. Science,
Theory of Pattern Recognition. Springer, 1996. (290):23232326, 2000.
[9] M. do Carmo. Riemannian Geometry. Birkhauser, [24] J. Tenenbaum, V. de Silva, and J. Langford. A global
1992. geometric framework for nonlinear dimensionality
[10] D. Donoho. Compressed sensing. IEEE Transactions reduction. Science, 290(5500):23192323, 2000.
on Information Theory, 52(4):12891306, 2006.
[11] R. Durrett. Probability: Theory and Examples. APPENDIX
Duxbury, second edition, 1995.
[12] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. A. ASSOUAD DIMENSION OF A SMOOTH
Learning the structure of manifolds using random MANIFOLD
projections. In Neural Information Processing
If M is a d-dimensional Riemannian submanifold of RD ,
Systems, 2007.
what is its Assouad dimension? An easy case is when M is an
[13] H. Fuchs, Z. Kedem, and B. Naylor. On visible surface
affine set, in which case it has the same Assouad dimension
generation by a priori tree structures. Computer
as Rd , namely O(d). We may expect that for more general
Graphics, 14(3):124133, 1980.
M , the same holds true of small enough neighborhoods.
[14] A. Gionis, P. Indyk, and R. Motwani. Similarity search Recall that we define balls with respect to Euclidean dis-
in high dimensions via hashing. In 25th International tance in RD rather than geodesic distance on M . If a neigh-
Conference on Very Large Databases, 1999. borhood M B(x, r) has high curvature (speaking infor-
[15] S. Graf and H. Luschgy. Foundations of quantization mally), then it could potentially have large Assouad dimen-
for probability distributions. Springer, 2000. sion. For instance, it could be a 1-dimensional manifold and
[16] R. Gray and D. Neuhoff. Quantization. IEEE yet curve so much that (2D ) balls of radius r/2 are needed
Transactions on Information Theory, 44(6):23252383, to cover it (Figure 3). We therefore limit attention to mani-
1998. folds of bounded curvature, and to values of r small enough
[17] A. Gupta, R. Krauthgamer, and J. Lee. Bounded that the pieces of M in B(x, r) are relatively flat.
geometries, fractals, and low-distortion embeddings. In To formalize things, we need a handle on how curved the
IEEE Symposium on Foundations of Computer manifold M is locally. This is a relationship between the
Science, pages 534544, 2003. Riemannian metric on M and that of the space RD in which
[18] J. Heinonen. Lectures on Analysis on Metric Spaces. it is immersed, and is captured by the second fundamental
Springer, 2001. form (chapter 6 of [9]). For any point p M , this is a
[19] P. Indyk and A. Naor. Nearest neighbor preserving symmetric bilinear form B : Tp Tp Tp , where Tp denotes
embeddings. ACM Transactions on Algorithms, 3(3), the tangent space at p and Tp the normal space orthogonal
2007. to Tp . Our assumption on curvature is the following.
x0
Assumption. The norm of the second fundamen- level 0: 1 center
tal form is uniformly bounded by some 0; that

is, for all p M and unit norm Tp and u Tp ,
we have h,B(u,u)i
hu,ui
. centers of balls level 1: 2d centers

We will henceforth limit attention to balls of radius O(1/). /2


An additional minor effect is that M B(x, r) may consist level 2: 22d centers
of several connected components, in which case we need to
cover each of them. If there are N components, this can
add a factor of log N to the Assouad dimension, making it Figure 4: A hierarchy of covers. At level k, there are
O(d + log N ). 2kd points in the cover. Each of them has distance
Almost all the technical details needed to bound the As- /2k to its children (which constitute the cover at
souad dimension of manifolds appear in a separate context level k + 1). At the leaves are individual points of S.
in [22]. Here we just put them together differently.

Theorem 22. Suppose M is a d-dimensional Rieman- B.2 Proof of Lemma 6


nian submanifold of RD that satisfies the assumption above Pick a cover of S B(x0 , ) by 2d balls of radius /2.
for some 0. For any x RD and 0 < r 1/2, Without loss of generality, we can assume the centers of
the set M B(x, r) can be covered by N 2O(d) balls of ra- these balls lie in B(x0 , ). Each such ball B induces a subset
dius r/2, where N is the number of connected components S B; cover each such subset by 2d smaller balls of radius
of M B(x, r). /4, once again with centers in B. Continuing this process,
Proof. Well show that each connected component of the final result is a hierarchy of covers, at increasingly finer
M B(x, r) can be covered by 2O(d) balls of radius r/2. To granularities (Figure 4).
this end, fix one such component, and denote its restriction Pick any center u at level k of the tree, along with one of
to B(x, r) by M . its children v at level k + 1. Then ku vk /2k . Letting
Pick p M , and let Tp be the tangent space at p. Now u
e, ve denote the projections of these two points, we have from
consider the projection of M onto Tp ; let f denote this Lemma 5(b) that
" k #
projection map. We make use of two facts proved in [22]. 3
P |e u ve|
Fact 1 (Lemma 5.4 of [22]). The projection map D 4
f : M Tp is 1 1. " k #
3 ku vk
Now, f (M ) is contained in a d-dimensional ball of radius P |e u ve|
2 D
2r and can therefore be covered by 2O(d) balls of radius r/4.
k 2k ! k
We are almost done, as long as we can show that for any 2 2 2
3
such ball B Tp , the inverse image f 1 (B) is contained in exp e(k+1)d
3 2 2 3
a D-dimensional ball of radius r/2. This follows from
p
Fact 2 (implicit in proof of Lemma 5.3 of [22]). using = 2(d + ln(2/)), and (3/2)2k k + 1 (for all
For any x, y M , k 0). Now take a union bound over all edges (u, v) in the
tree. There are 2(k+1)d edges between levels k and k + 1, so
kf (x) f (y)k2 kx yk2 (1 r2 2 ). h ` k i
P k : u in level k with child v : |e u ve| D 34
Thus the inverse image of the cover in the tangent space k
X 1
yields a cover of M . 2 (k+1)d
e(k+1)d
3 1 (/3)
k=0

B. VARIOUS PROOFS where for the last step we observe 3/2 whenever d 1.
So with probability at least 1 , for all k, every edge
B.1 Proof of Lemma 5 between levels k and k +1 in the tree has projected length at
Since U has a Gaussian distribution, and any linear com- most (3/4)k / D. Thus every projected
` point in Se has
bination of independent Gaussians is a Gaussian, it follows a distance from x e0 of at most D 1 + 4 + ( 43 )2 + =
3

that the projection U x is also Gaussian. Its mean and 4


.
D
Plugging in the value of then yields the lemma.
variance are easily seen to be zero and kxk2 /D, respectively.
Therefore, writing Z = kxk D
(U x), we have that Z N (0, 1). B.3 Proof
p
of Lemma 7
The bounds stated in the lemma now follow from properties Set c = 2 ln 1/() 2.
of the standard normal. In particular, N (0, 1) is roughly flat Fix any point x, and randomly choose a projection U .
in the range [1, 1] and then drops off rapidly; the two cases What is the chance that xe lands farfrom xe0 ? Define the bad
in the lemma statement correspond to these two regimes. event to be Fx = 1(|e
xx e0 | c/ D). By Lemma 5(b),
The
highest density achieved by the standard normal is
1/ 2. Thus the probability mass it assigns to the interval kx x0 k 2 c2 /2
EU [Fx ] PU |exx e0 | c e .
[, ] is at most 2/ 2; this takes care of (a). For (b), D c
we use a standard tail bound for the normal, P(|Z| ) Since this holds for any x S, it also holds in expectation
2
(2/)e /2 ; see, for instance, page 7 of [11]. over x drawn from . We are interested in bounding the
probability (over the choice of U ) that more than an frac- Let , 1 , 2 denote the means of S, S1 , and S2 . Using
tion of falls far from x
e0 . Using Markovs inequality and Corollary 14 and Lemma 13(a), we have
then Fubinis theorem, we have
|S1 | 2 |S2 | 2
2A (S) A (S1 ) A (S2 )
EU [E [Fx ]] E [EU [Fx ]] |S| |S|
PU [E [Fx ] ] = .
2 X |S1 | 2 X
= kx k2 kx 1 k2
|S| S |S| |S1 | S
B.4 Proof of Lemma 9 1

It will help to define the failure probabilities 1 = 2/e 31 |S2 | 2 X


kx 2 k2
and 2 = 1/20. In the proof sketch above, we defined four |S| |S2 | S
2
properties that make a projection U good. We now verify (
that they all hold with probability at least 1/2. 2 X`
= kx k2 kx 1 k2
Property (1) follows by applying Lemma 6 to each ball in |S| S
turn. For B, we havethatp with probability at least 1 e
1 , B
1
)
is X` 2
pwithin radius (4r/ D)
2(d + ln(2/1 )) (/128 D)
+ 2
kx k kx 2 k
2 ln(2e/1 ) = /(16 D) of ze. Similarly with B , so this S2
property holds with probability at least 1 21 .
2|S1 | 2|S2 |
(2) follows from Lemma 5(a); specifically, it fails with = k1 k2 + k2 k2 .
probability at most 4/5 for = 1/(2 (4r/)) 128/255. |S| |S|
Property
p (3) is from Lemma 5(b), with probability
p at least Writing as a weighted average of 1 and 2 then completes
1 22 / 2 ln(2/2 ) (in that lemma, use = 2 ln(2/2 )). the proof.
Finally, (4) holds with probability 12 by Corollary 8.
B.8 Concentration of quadratic forms
B.5 Proof of Lemma 10 Lemma 23. Suppose A is an n n positive semidefinite
Define 1 as in the previous proof. As before (property matrix, and U N (0, (1/n)In ). Then for any , > 0:
(1)), with probability at least e
1 21 , the projections B and (a) P[U T AU < E[U T AU ]] e((1/2))/2 , and
e
B lie within radii /(16 D) of their respective ze, ze .
e and B e to both intersect the split point, (b) P[U T AU > E[U T AU ]] e(2)/4 .
In order for B
two unlikely events need to occur: first, B e must intersect Proof. This follows by examining the moment-generating
e
B ; second, the split point must intersect B. e These are in- function of U T AU . Since the distribution of U is spheri-
dependent events (one involves the projection and the other cally symmetric, we can work in the eigenbasis of A and
involves the split point), so we will bound them in turn. assume without loss of generality that A = diag(a1 , . . . , an ),
wherePa1 , . . . , an are the eigenvalues. For convenience we
P[Be intersects B
e ] P[|e z ze | /(8 D)] take ai = 1.
r r Let U1 , . . . , Un denote the individual coordinates of U .
2 /(8 D) 2 64
We can rewrite them as Ui = Zi / n, where Z1 , . . . , Zn are
(1/ D) ((/2) r) 255 i.i.d. random variables. Thus U T AU =
P standard 2
normal
P 2 T
by Lemma 5(a) and the conditions on r. i ai Ui = (1/n) i ai Zi , and E[U AU ] = 1/n.
We use Chernoffs bounding method for both parts. For

/(8 D) 1 (a), for any t > 0,
e
P[split point intersects B] = . " #
12/ D 96 h i X
T T 2
P U AU < E[U AU ] = P ai Z i <
e B
So the probability B, e both intersect the split point is at i
e e e < 1/384. h i
most 21 + P[B, B touch] P[split point touches B] 2
P
h P 2
i E et i ai Zi
= P et i ai Zi > et
B.6 Proof of Lemma 12 et
Y h ta Z 2 i Y 1
1/2
Let random variable X be distributed
uniformly over S. = et E e i i = et
P kX EXk2 median(kX EXk2 ) 1/2 by definition 1 + 2tai
i i
of median, so E kX EXk2 median(kX EXk2 )/2. It
follows from Corollary 14 that and the rest follows by using t = 1/2 along with 1/(1 + x)
ex/2 for 0 < x 1. Similarly for (b), for 0 < t < 1/2,
median(kX EXk2 ) 2E kX EXk2 = 2A (S). " #
h i X
T T 2
S1 has squared diameter 2 (S1 ) (2 median(kX EXk))2 P U AU > E[U AU ] = P ai Z i >
i
42A (S). Meanwhile, S2 has squared diameter at most h P i
2
2 (S). Therefore, h P 2
i E et i ai Zi
= P et i ai Zi > et
|S1 | 2 |S2 | 2 1 1 et
(S1 ) + (S2 ) 42A (S) + 2 (S) Y h i Y 1/2
|S| |S| 2 2 2 1
= et E etai Zi = et
1 2tai
and the lemma follows by using 2 (S) > c2A (S). i i

and it is adequate to choose t = 1/4 and invoke 1/(1 x)


B.7 Proof of Lemma 15 e2x for 0 < x 1/2.

You might also like