Spectral Clustering Based On The Graph P-Laplacian

Spectral Clustering based on the graph p-Laplacian
Thomas Bühler tb@cs.uni-sb.de

Matthias Hein hein@cs.uni-sb.de
Saarland University, Computer Science Department, Campus E1 1, 66123 Saarbrücken, Germany
Abstract Kahng, 1991) and normalized cut (Shi & Malik, 2000).
There are also relaxations of balanced graph cut cri-
We present a generalized version of spec-
teria based on semi-definite programming (De Bie &
tral clustering using the graph p-Laplacian,
Cristianini, 2006), which turn out to be better than the
a nonlinear generalization of the standard
standard spectral ones but are computationally more
graph Laplacian. We show that the second
expensive.
eigenvector of the graph p-Laplacian inter-
polates between a relaxation of the normal- In this paper we establish a connection between the
ized and the Cheeger cut. Moreover, we Cheeger cut and the second eigenvector of the graph
prove that in the limit as p → 1 the cut p-Laplacian, a nonlinear generalization of the graph
found by thresholding the second eigenvec- Laplacian. A p-Laplacian which differs slightly from
tor of the graph p-Laplacian converges to the the one used in this paper has been used for semi-
optimal Cheeger cut. Furthermore, we pro- supervised learning by Zhou and Schölkopf (2005).
vide an efficient numerical scheme to com- Our main motivation for the use of eigenvectors of the
pute the second eigenvector of the graph p- graph p-Laplacian was the generalized isoperimetric
Laplacian. The experiments show that the inequality of Amghibech (2003) which relates the sec-
clustering found by p-spectral clustering is at ond eigenvalue of the graph p-Laplacian to the optimal
least as good as normal spectral clustering, Cheeger cut. The isoperimetric inequality becomes
but often leads to significantly better results. tight as p → 1, so that the second eigenvalue converges
to the optimal Cheeger cut value. In this article we ex-
tend the isoperimetric inequality of Amghibech to the
1. Introduction unnormalized graph p-Laplacian. However, our key re-
sult is to show that the cut obtained by thresholding
In recent years, spectral clustering has become one of the second eigenvector of the p-Laplacian converges to
the major clustering methods. The reasons are its the optimal Cheeger cut as p → 1, which provides the-
generality, efficiency and its rich theoretical founda- oretical evidence that p-spectral clustering is superior
tion. Spectral clustering can be applied to any kind of to the standard case. Moreover, we provide an efficient
data with a suitable similarity measure and the clus- algorithmic scheme for the (approximate) computation
tering can be computed for millions of points. The of the second eigenvector of the p-Laplacian and the
theoretical background includes motivations based on resulting clustering. This allows us to do p-spectral
balanced graph cuts, random walks and perturbation clustering also for large scale problems. Our experi-
theory. We refer to (von Luxburg, 2007) and references mental results show that as one varies p from 2 (stan-
therein for a detailed introduction to various aspects dard spectral clustering) to 1 the value of the Cheeger
of spectral clustering. cut obtained by thresholding the second eigenvector of
In this paper our focus lies on the motivation of spec- the graph p-Laplacian is always decreasing.
tral clustering as a relaxation of balanced graph cut In Section 2, we review balanced graph cut criteria. In
criteria. It is well known that the second eigenvectors Section 3, we introduce the graph p-Laplacian followed
of the unnormalized and normalized graph Laplacians by the definition of eigenvectors of nonlinear operators.
correspond to relaxations of the ratio cut (Hagen & In Section 4, we provide the theoretical key result re-
lating the cut found by thresholding the second eigen-
Appearing in Proceedings of the 26 th International Confer- vector of the graph p-Laplacian to the optimal Cheeger
ence on Machine Learning, Montreal, Canada, 2009. Copy-
right 2009 by the author(s)/owner(s). cut. The algorithmic scheme is presented in Section 5
and extensive experiments on various datasets, includ- cut NCC(C, C):

ing large scale ones, are given in Section 6.
NCC(C, C) ≤ NCut(C, C) ≤ 2 NCC(C, C).
2. Balanced graph cut criteria The analogous result holds for the ratio cut
RCut(C, C) and the ratio Cheeger cut RCC(C, C).
Given a set of points in a feature space and a similarity It is known that finding the global optimum of all
measure, the data can be transformed into a weighted, these balanced graph cut criteria is NP-hard, see (von
undirected graph G, where the vertices V represent Luxburg, 2007). In Section 4, we will show how spec-
the points in the feature space and the positive edge tral relaxations of these criteria are related to the
weights W encode the similarity of pairs of points. A eigenproblem of the graph p-Laplacian.
clustering of the points is then equivalent to a parti-
tion of V into subsets C1 , . . . , Ck (which will be called Up to now the cuts are just defined for a partition
clusters in the following). The usual objective for such of V into two sets. For a partition of V into k sets
a partitioning is to have high within-cluster similarity C1 , . . . , Ck the ratio and normalized cut can be gener-
and low inter-cluster similarity. Additionally, the clus- alized (von Luxburg, 2007) as
ters should be balanced in the sense that the “size” of k
the clusters should not differ too much. All the graph X cut(Ci , Ci )
RCut(C1 , . . . , Ck ) = , (1)
cut criteria presented in this section implement these i=1
|Ci |
objectives with slightly different emphasis on the indi- k
vidual properties.
X cut(Ci , Ci )
NCut(C1 , . . . , Ck ) = . (2)
vol(Ci )
Before the definition of the balanced graph cut criteria, i=1
we have to introduce some notation. The number of There seems to exist no generally accepted multi-
points is denoted by n = |V | and the complement of a partition version of the Cheeger cuts. We come back
set A ⊂ V is written as A = V \A. The degree
Pn function to this issue in Section 5, when we discuss how to get
d : V → R of the graph is given as di = j=1 wij and multiple clusters using the second eigenvector of the
the cut of A ⊂ V and A is defined as graph p-Laplacian.
X
cut(A, A) = wij .
i∈A, j∈A
3. The graph p-Laplacian
Moreover, we denotePby |A| the cardinality of the set It is well known, see e.g. (Hein et al., 2007), that the
A and by vol(A) = i∈A di the volume of A. In the standard graph Laplacian ∆2 can be defined as the
balanced graph cut criteria one either tries to balance operator which induces the following quadratic form
the cardinality or the volume of the clusters. for a function f : V → R:
n
The ratio cut RCut(C, C) (Hagen & Kahng, 1991) and 1 X
hf, ∆2 f i = wij (fi − fj )2 .
the normalized cut NCut(C, C) (Shi & Malik, 2000) for 2 i,j=1
a partition of V into C, C are defined as
For the standard inner product one gets the unnor-
cut(C, C) cut(C, C) (u)
RCut(C, C) = + , malized graph Laplacian ∆2 which in matrix nota-
|C| |C| (u)
cut(C, C) cut(C, C)
tion is given as ∆2 = DP−n
W , and for the weighted
NCut(C, C) = + . inner product, hf, gi = i=1 di fi gi , one obtains the
vol(C) vol(C) (n)
normalized1 graph Laplacian ∆2 given as ∆2 =
(n)
A slightly different balancing behavior is induced by I − D−1 W . One can ask now if there exists an op-
the corresponding ratio Cheeger cut RCC(C, C) and erator ∆p which induces the general form (for p > 1),
normalized Cheeger cut NCC(C, C) defined as n
1 X
hf, ∆p f i = wij |fi − fj |p .
cut(C, C) 2 i,j=1
RCC(C, C) = ,
min{|C| , C }
It turns out that this question can be answered pos-
cut(C, C)
NCC(C, C) = . itive, see (Amghibech, 2003). The resulting operator
min{vol(C), vol(C)}
1
Note that our notation differs from the one in (Hein
One has the following simple relation between the nor- et al., 2007) where they denote our normalized graph
malized cut NCut(C, C) and the normalized Cheeger Laplacian as “random walk graph Laplacian”.
∆p is the graph p-Laplacian (which we abbreviate as now be carried over to nonlinear operators. We define
(u)
p-Laplacian if no confusion is possible). Similar to the for the unnormalized p-Laplacian ∆p ,
graph Laplacian we obtain, dependent on the choice of
n
the inner product, the unnormalized and normalized D
(u)
E 1 X
(u) (n) Qp (f ) := f, ∆p f = wij |fi − fj |p ,
p-Laplacian ∆p and ∆p . Let i ∈ V , then 2 i,j=1
X
(∆(u)
p f )i = wij φp (fi − fj ), and define similarly the functional Fp : RV → R,
j∈V
1 X Qp (f )
(∆(n)
p f )i = wij φp (fi − fj ). Fp (f ) := p .
di
j∈V
kf kp
where φp : R → R is defined for x ∈ R as Theorem 3.1 The functional Fp has a critical point
at v ∈ RV if and only if v is a p-eigenfunction of
p−1 (u)
φp (x) = |x| sign(x). ∆p . The corresponding eigenvalue λp is given as
λp = Fp (v). Moreover, we have Fp (αf ) = Fp (f ) for
Note that φ2 (x) = x, so that we recover the standard all f ∈ RV and α ∈ R.
graph Laplacians for p = 2. In general, the p-Laplacian
is a nonlinear operator: ∆p (αf ) 6= α ∆p f for α ∈ R. Proof: One can check that the condition for a critical
point of Fp at v can be rewritten as
3.1. Eigenvalues and eigenvectors of the graph
p-Laplacian Qp (v)
∆p v − p φp (v) = 0.
kvkp
Since our goal is to use the p-Laplacian for spectral
clustering, the natural question arises how one can Thus, with Definition 3.1 v is an eigenvector of ∆p .
define eigenvectors and eigenvalues for such a non- Moreover, the equation implies that a given eigenvec-
linear operator. For notational simplicity we restrict tor v to the eigenvalue λp is a critical point of Fp if
us in this section to the case of the unnormalized p- λp = Fp (v). Summing up the eigenvector equation of
(u)
Laplacian ∆p but all definitions and results carry Definition 3.1 shows this equality. The last statement
(n) follows directly from the definition.
over to the normalized version ∆p .
This theorem shows that in order to get all eigen-
Definition 3.1 The real number λp is called an eigen- (u)
(u) vectors and eigenvalues of ∆p we have to find all
value for the p-Laplacian ∆p if there exists a function critical points of the functional Fp . Moreover, with
v : V → R such that Fp (αf ) = Fp (f ), we observe that the usual property
(∆(u) ∀ i = 1, ..., n. for linear operators that eigenvectors are invariant un-
p v)i = λp φp (vi ) ,
der scaling carries over to the nonlinear case. The
(u) following proposition is a generalization of a result
The function v is called a p-eigenfunction of ∆p cor-
by Fiedler (1973) to the graph p-Laplacian. It re-
responding to the eigenvalue λp .
lates the connectivity of the graph to properties of the
(1)
The origin of this definition of an eigenvector for non- first eigenvalue λp of the p-Laplacian. We denote by
V
linear operators lies in the Rayleigh-Ritz principle, a 1A ∈ R the function which is one on A and zero else.
variational characterization of eigenvalues and eigen-
vectors for linear operators. For a symmetric matrix Proposition 3.1 The multiplicity of the first eigen-
(1) (u)
A ∈ Rn×n , it is well-known that one can obtain the value λp = 0 of the p-Laplacian ∆p is equal to the
smallest eigenvalue λ(1) and the corresponding eigen- number K of connected components C1 , . . . , CK of the
(1)
vector v (1) satisfying Av (1) = λ(1) v (1) via the varia- graph. The corresponding eigenspace for λp = 0 is
PK
tional characterization given as { i=1 αi 1j∈Ci | αi ∈ R, i = 1, . . . , K}.
hf, A f iRn Proof: We have Qp (f ) ≥ 0, so that all eigenvalues
v (1) = arg min 2 ,
f ∈Rn kf k2 (u)
λp of ∆p are non-negative. Similar to the case p = 2,
Pn
p Pn p one can check that i,j=1 wij |fi −fj |p = 0, if and only
where the p-norm is defined as kf kp := i=1 |fi | . if f is constant on each connected component.
Note that this characterization implies that (up to
rescaling) v (1) is the global minimizer of hf, Af i sub- In spectral clustering the graph is usually assumed to
(1)
ject to kf k2 = 1. This variational characterization can be connected, so that vp = c1 for c ∈ R, otherwise
spectral clustering is trivial. For the following we as- 4.1. Spectral relaxation of balanced graph cuts
sume that the graph is connected. The previous propo-
It is well known that the second eigenvector of the
sition suggests that similar to the standard case p = 2
unnormalized and normalized standard graph Lapla-
we need at least the second eigenvector to construct a
cians (p = 2) is the solution of a relaxation of the ratio
partitioning of the graph. For p = 2, we get the sec-
cut RCut(C, C) and normalized cut NCut(C, C), see
ond eigenvector again by the variational Rayleigh-Ritz
e.g. (von Luxburg, 2007). We will show now that the
principle, (2)
second eigenvector vp of the p-Laplacian can also be

(u)
(2) f, ∆2 f seen as a relaxation of balanced graph cuts.
v = arg min 2 | hf, 1i = 0. .
f ∈Rn kf k2
Theorem 4.1 For p > 1 and every partition of V
This form is not suited for the p-Laplacian since its into C, C there exists a function fp,C ∈ RV such that
eigenvectors are not necessarily orthogonal. However, (2)
the functional Fp associated to the unnormalized p-
for a function with hf, 1i = 0 one has Laplacian satisfies
1 2
2 2 p−1
kf k2 = f − hf, 1i 1 = minc∈R kf − c 1k2 .

1 1
Fp(2) (fp,C ) = cut(C, C)

n 2 1 + 1 ,
Thus, we can write equivalently, |C| p−1 C p−1

(u)
(2) f, ∆2 f with the special cases,
v = arg min 2.
f ∈Rn minc∈R kf − c 1k2 (2)
F2 (f2,C ) = RCut(C, C),
(2)
This motivates the definition of Fp : RV → R, lim Fp(2) (fp,C ) = RCC(C, C).
p→1
Qp (f )
Fp(2) (f ) = p. (2)
minc∈R kf − c1kp Moreover, one has Fp (fp,C ) ≤ 2p−1 RCC(C, C).
Equivalent statements hold for a function gp,C for the
(2)
Theorem 3.2 The second eigenvalue λp of the normalized cut and the normalized p-Laplacian ∆p .
(n)
(u)
graph p-Laplacian ∆p is equal to the global mini-
(2) Proof: Let p > 1, then we define for a partition C, C
mum of the functional Fp . The corresponding eigen-
(2) (u) (2)
vector vp of ∆p is then given as vp = u∗ − c∗ 1 of V the function fp,C : V → R as
(2)
for any global
Pn minimizer u∗ of Fp , where c∗ = 1
∗ p
arg minc∈R i=1 |ui −c| . Furthermore, the functional 1/ |C| p−1 , i ∈ C,
(fp,C )i = 1
(2) (2) (2)
Fp satisfies Fp (tu + c1) = Fp (u), for all t, c ∈ R. −1/ C p−1 , i ∈ C.
p
Proof: Can be found in (Bühler & Hein, 2009). One has Qp (fp,C ) =
P 1
1 + 1
1 .

i∈C, j∈C
|C| p−1 |C | p−1
Thus, instead of solving the complicated nonlinear Moreover, one has
equation of Definition 3.1 to obtain the second eigen- p p
minc∈R kfp,C − c1kp = kfp,C kp = 1 1
+
1 . 1
vector of the graph p-Laplacian, we just have to find |C | p−1 |C| p−1
(2) (2) p
the global minimum of the functional Fp . In the With Fp (fp,C ) = Qp (fp,C )/minc∈R kf − c1kp , we get
next section, we discuss the relation between the sec-
(2) p−1
ond eigenvalue λp of the graph p-Laplacian and the

X 1 1
Fp(2) (fp,C )

balanced graph cuts of Section 2. In Section 5, we pro- = wij 1 + 1
i∈C,y∈C
|C| p−1 C p−1
vide an algorithmic framework to compute the second
p−1
eigenvector of the p-Laplacian efficiently.

X 2
1 = 2p−1 RCC(C, C).

≤ wij
i∈C,y∈C
min{|C| , C } p−1
4. Spectral properties of the graph
p-Laplacian and the Cheeger cut The first equality shows the general result and simpli-
fies to the ratio cut for p = 2. The limit p → 1 follows
Now that we have discussed the variational characteri- with limα→∞ (aα + bα )1/α = max{a, b}.
zation of the second eigenvector of the p-Laplacian, we
will provide the relation to the relaxation of balanced Thus, since one minimizes over all functions in the
graph cut criteria as it can be done for the standard eigenproblem for the second eigenvector of the p-
(u) (n)
graph Laplacian. Laplacian ∆p and ∆p it is a relaxation of the
(2)
ratio/normalized cut for p = 2 and for the ra- Theorem 4.2 (Amghibech, 2003) Denote by λp
tio/normalized Cheeger cut in the limit of p → 1. the second eigenvalue of the normalized p-Laplacian
In the interval 1 < p < 2 the eigenproblem can be (n)
∆p . Then for any p > 1,
seen as as a relaxation of the interpolation between ra- p
tio/normalized cut and the ratio/normalized Cheeger hNCC
(2) (u) 2p−1 ≤ λ(2)
p ≤ 2p−1 hNCC .
cut, for the functional Fp of ∆p we get, p
p−1
1 1 We extend the result of Amghibech to the unnormal-
Fp(2) (fp,C )

= cut(C, C) 1 + 1 ,
|C| p−1 C p−1 ized p-Laplacian.
(2)
which can be understood using the inequalities be- Theorem 4.3 Denote by λp the second eigenvalue
(u)
tween lp -norms, for α ≥ β ≥ 1 one has kxkβ ≥ kxkα , of the unnormalized p-Laplacian ∆p . For p > 1,
α1 p−1 p
1 1

1 1

1 1
2 hRCC
+ ≥ + ≥ max , , ≤ λ(2)
p ≤ 2p−1 hRCC .
|C| |C| |C| α
|C|α |C| |C| maxi di p
with α = 1/(p − 1) and thus for 1 < p < 2, one has Proof: Can be found in (Bühler & Hein, 2009).
∞ > α > 1. hRCC
Note that hNCC < 1 and max i di
< 1, so that in both
The spectral relaxation of ratio (Hagen & Kahng, cases the left hand side of the bound is smaller than
1991) and normalized cut (Shi & Malik, 2000) was one hNCC resp. hRCC . When considering the limit p → 1,
of the main motivations for standard spectral cluster- one observes that the bounds on λp become tight as
ing. There exist other possibilities to relax the ratio p → 1. Thus in the limit of p → 1, the second eigen-
and normalized cut problem, see (De Bie & Cristianini, value of the unnormalized/normalized p-Laplacian ap-
2006), which lead to a semi-definite program. These proximates the optimal ratio/normalized Cheeger cut
relaxations give better bounds on the true cut than the arbitrarily well.
standard spectral relaxation (p = 2), though they are
computationally expensive. However, up to our knowl- Still the problem remains how to transform the real-
edge the bounds which can be achieved by semidefinite valued second eigenvector of the p-Laplacian into a
programming are not as tight as the ones which we pro- partitioning of the graph. We use the standard proce-
(2)
vide in the next section for the p-Laplacian as p → 1. dure and threshold the second eigenvector vp to ob-
tain the partitioning. The optimal threshold is deter-
4.2. Isoperimetric Inequality - the second mined by minimizing the corresponding Cheeger cut.
(2)
(2)
eigenvalue λp and the Cheeger cut For the second eigenvector vp of the unnormalized
(u)
graph p-Laplacian ∆p we determine,
The isoperimetric inequality (Chung, 1997) for the
graph Laplacian (p = 2) provides additional theoreti- arg min RCC(Ct , Ct ), (3)
(2)
cal backup for the spectral relaxation. It provides up- Ct ={i∈V | vp (i)>t}
per and lower bounds on the ratio/normalized Cheeger
(2)
cut in terms of the second eigenvalue of the graph p- and similarly for the second eigenvector vp of the
(n)
Laplacian. We define the optimal ratio and normalized normalized graph p-Laplacian ∆p we compute,
Cheeger cut values hRCC and hNCC as
arg min NCC(Ct , Ct ). (4)
(2)
hRCC = inf RCC(C, C) and hNCC = inf NCC(C, C). Ct ={i∈V | vp (i)>t}
C C
The obvious question is how good the cut values ob-
The standard isoperimetric inequality for p = 2 (see
tained by thresholding the second eigenvector of the
Chung, 1997) is given as
p-Laplacian are compared to optimal Cheeger cut val-
h2NCC (2)
ues. The following Theorem answers this question and
≤ λ2 ≤ 2 hNCC , provides the key motivation for p-spectral clustering.
2
(2)
where λ2 is the second eigenvalue of the standard Theorem 4.4 Denote by h∗RCC and h∗NCC the ra-
normalized graph Laplacian (p = 2). The isoperimet- tio/normalized Cheeger cut values obtained by tresh-
(2)
ric inequality for the normalized p-Laplacian has been olding the second eigenvector vp of the unnormal-
(u)
proven by Amghibech (2003). ized/normalized p-Laplacian via (3) for ∆p resp. (4)
Algorithm 1 p-Laplacian based Spectral Clustering often very fast to convergence to a non-optimal lo-
1: Input: weight matrix W , number of desired clus- cal minimum. Thus we use a different procedure us-
ters k, choice of p-Laplacian. ing the fact that for p = 2 we can easily compute
(2)
2: Initialization: cluster C1 = V , number of clus- the global minimizer of F2 . It is just the second
ters s = 1 eigenvector of the standard graph Laplacian, which
3: repeat can be efficiently computed for sparse matrices e.g.
(2)
4: Minimize Fp : RCi → R for the chosen p- using ARPACK. Since the functional Fp (f ) is contin-
Laplacian for each cluster Ci , i = 1, . . . , s. uous in p, we can hope for close values p1 and p2 that
(2) (2)
5: Compute optimal threshold for dividing each the global minimizer of Fp1 and Fp2 are also close (at
(u) (n)
cluster Ci via (3) for ∆p or (4) for ∆p . least the local minimizer should be close). Moreover, it
6: Choose to split the cluster Ci so that the total is well known that Newton-like methods have superlin-
multi-partition cut criterion is minimized (ratio ear convergence close to the local optima (Bertsekas,
(u) (n) 1999). These two facts suggest to solve the problem
cut (1) for ∆p and normalized cut (2) for ∆p ).
(2)
7: s⇐s+1 Fp (u) by minimizing a sequence of functionals Fpi ,
8: until number of clusters s = k
Fp(2)
0
, Fp(2)
1
, ..., Fp(2) , with p0 = 2 > p1 > ... > p ,
(n)
for ∆p . Then for p > 1, where each step is initialized with the solution of the
previous step and initialization is done with p0 = 2.
p−1 p1 In the experiments we found that the update rule
hRCC ≤ h∗RCC ≤ p maxi∈V di p
hRCC ,
1 pt+1 = 0.9 pt yields a good trade-off between decreas-
hNCC ≤ h∗NCC ≤ p hNCC p . ing too fast with the danger that the optimum for Fpt
(2)
(2)
is far away from the optimum of Fpt+1 and decreasing
Proof: Can be found in (Bühler & Hein, 2009). too slow which yields fast convergence of the Newton
method but needs a lot of iterations.
One observes that in the limit of p → 1 both inequal-
ities become tight, which implies that for p → 1 the The minimimization of the functionals Fpt is done us-
cut found by thresholding the second eigenvector of ing a mixture of gradient and Newton steps. However,
the p-Laplacian converges to the optimal Cheeger cut. the Hessian of Fpi is not sparse, which causes problems
for large scale problems, but it can be decomposed into
5. p-Spectral Clustering H = A + abT + baT + bbT ,

The algorithmic scheme for p-Spectral Clustering is

where a, b ∈ Rn and the matrix A is sparse. Thus
shown in Algorithm 1. More than two clusters are ob-
H is a sum of a sparse matrix plus low-rank updates.
tained by consecutive splitting of clusters until the de-
Thus, we just discard the low-rank updates and use
sired number of clusters is reached. As multi-partition
A as a surrogate for the true Hessian. We use the
criterion, we use the established generalized versions
Minimal Residual method (Paige & Saunders, 1975)
of ratio cut (1) and normalized cut (2). However, one
for solving the linear system of the Newton step as
could also think about multi-partition versions of the
the matrix A is symmetric but not necessarily posi-
Cheeger cut. The sequential splitting of clusters is the
tive definite. In order to avoid problems with an ill-
more “traditional” way to do spectral clustering. Al-
conditioned matrix A, we add a small ridge. Note that
ternatively, one uses for the standard graph Laplacian p (2)
the first k eigenvectors to define a new representation the term minc∈R kf − c1kp in the functional Fp (f ) is
of the data. In this new k-dimensional representa- itself a (convex) optimization problem which can be
tion one then applies a standard clustering algorithm solved very fast using bisection.
like k-means. This alternative is not possible in our
case since at the moment we are not able to compute 6. Experimental evaluation
higher-order eigenvectors of the p-Laplacian. However,
as Theorem 4.4 shows there is also need for going this In all experiments, we used a symmetric K-NN graph
way since thresholding will yield the optimal Cheeger with K = 10 and weights wij defined as
cut in the limit p → 1. − 4
kxi −xj k2
σ2
(2)
wij = max{si (j), sj (i)}, where si (j) = e i ,
The functional Fp : RV → R is non-convex and thus
we cannot guarantee to reach the global minimum. In- with σi being the Euclidean distance of xi to its
deed, a direct minimization for small values of p leads K-nearest neighbor. We evaluate the clustering on
datasets with known number of classes k. We then Table 1. Top: Results of unnormalized p-spectral cluster-
clustered the data into k clusters and checked the ing with k = 10 for USPS and MNIST using the ratio-
agreement of the found clusters C1 , . . . , Ck with the multi-partition criterion (1). In both cases the RCut and
class structure using the error measure the error significantly decrease as p decreases. Bottom:
confusion matrix for MNIST of the clusters found by p-
k
1 XX spectral clustering for p = 1.2. Class 1 has been split into
error(C1 , .., Ck ) = IYj 6=Yi0 , (5) two clusters and class 4 and 9 have been merged. Thus
|V | i=1
j∈Ci there exists no class 9 in the table. Apart from the merged
classes the clustering reflects the class structure quite well.
where Yj is the true label of j and Yi0 is the dominant
label in cluster Ci .
USPS MNIST
p RCut Error RCut Error
6.1. High-dimensional noisy two moons
2.0 0.819 0.233 0.225 0.189
The two moons dataset is generated as two half-circles 1.9 0.741 0.142 0.209 0.172
in R2 which are embedded into a d-dimensional space 1.8 0.718 0.141 0.186 0.170
where Gaussian noise N (0, σ 2 Id ) is added. When vary- 1.7 0.698 0.139 0.170 0.169
1.6 0.684 0.134 0.164 0.170
ing d, n and σ, we always made the same observa-
1.5 0.676 0.133 0.161 0.133
tion: unnormalized and normalized p-spectral cluster- 1.4 0.693 0.141 0.158 0.132
ing leads for decreasing values of p to cuts with de- 1.3 0.684 0.138 0.155 0.131
creasing values of the Cheeger cuts RCC and NCC. 1.2 0.679 0.137 0.153 0.129
In Fig. 1, we illustrate this for the case d = 100, True/Cluster 0 1 2 3 4 5 6 7 8
n = 2000 and σ 2 = 0.02. Note that this dataset is far 0
1
6845
1
5
7794
7
32
0
8
5
21
8
1
26
2
4
16
3
2
from being trivial since the high-dimensional noise has 2 38 47 6712 25 15 5 8 114 26
3 5 6 31 6939 30 61 2 45 22
corrupted the graph (see the edge structure in Fig. 1). 4 3 45 2 1 6750 0 14 5 4
5 15 1 4 92 39 6087 61 5 9
The histogram of the values of the second eigenvectors 6 23 17 6 0 9 23 6797 0 1
for p equal to 2, 1.7, 1.4 and 1.1, show strong differ- 7 1 83 22 1 116 2 0 7067 1
8 18 51 13 507 112 122 23 18 5961
ences. For p = 2, the values are scattered over the 9 15 15 3 117 6708 11 4 77 8
interval, whereas for p = 1.1, they are almost concen-
trated on two peaks. This suggests that for p = 1.1, 6.3. USPS and MNIST
the p-eigenvector is quite close to the function fp,C
as defined in Theorem 4.1. The third row in Fig. 1 We perform unnormalized p-spectral clustering on the
shows the resulting clusters found by p-spectral clus- full USPS and MNIST-datasets (n = 9298 and n =
(n) 70000). In Table 1 one observes that for p → 1 the ra-
tering with ∆p . For p → 1, the clustering is almost
perfect despite the difficulty of this dataset. In order tio cut as well as the error decreases for both datasets.
to illustrate that this result is representative, we have The error is even misleading since the class separation
repeated the experiment 100 times. The plot in the is quite good but one class has been split which im-
bottom left of Fig. 1 shows the mean of the normal- plies that two classes have been merged. This happens
(2)
ized Cheeger cut, the second eigenvalue λp , normal- for both datasets and in Table 1 we provide the confu-
ized cut and error as p → 1. One observes that despite sion matrix for MNIST for p-spectral clustering with
there is some variance, the results of p-spectral clus- p = 1.2. For larger values of number of clusters k we
tering are significantly better than standard spectral thus expect better results. In the following table we
clustering. present the runtime behavior (in seconds) for USPS:
p 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2
t 10 81 99 144 224 456 1147 2266 4660
6.2. UCI-Datasets
In Table 2 we show results for p-spectral clustering on As p → 1, the problem becomes more difficult which
several UCI datsets both for the unnormalized (right is clear since one approximates asymptotically the op-
column) and the normalized p-Laplacian (left column). timal Cheeger cut. However, there is still room for
The corresponding Cheeger-cuts (second row) are con- improvement to speed up our current implementation.
sistently decreasing as p → 1. For most of the datasets
this also implies that the ratio/normalized cut de- Acknowledgments
creases. Note that the error is often constant despite
the fact that the cut is still decreasing. Opposite to This work has been supported by the Excellence Clus-
the other examples, minimizing the cut does not nec- ter on Multimodal Computing and Interaction at Saar-
essarily lead to a smaller error. land University.
0.16
NCut
NCC
0.14 2nd eigenvalue
Error
0.12
0.1
0.08
0.06
0.04
0.02
2 1.8 1.6 1.4 1.2 1
60 120 250 1000
50 100
200 800
40 80
150 600
30 60
100 400
20 40
50 200
10 20
0 0 0 0
−0.04 −0.02 0 0.02 0.04 −0.04 −0.02 0 0.02 0.04 −0.02 0 0.02 0.04 −0.01 0 0.01 0.02 0.03
0.2 NCut
NCC NCut 0.1461 NCC 0.07374 NCut 0.1408 NCC 0.07050 NCut 0.1155 NCC 0.05791 NCut 0.1035 NCC 0.05330
0.18
2nd eigenvalue
0.16 Error
0.14
0.12
0.1
0.08
0.06
0.04
0.02
2 1.8 1.6 1.4 1.2 1
Figure 1. Results for the two moons data set, 2000 points in 100 dimensions, noise variance 0.02. First row, from left to
right: Second eigenvector of the p-Laplacian for p = 2.0, 1.7, 1.4, 1.1. Second row: Histogram of the values of the second
eigenvector. Last row: Resulting clustering after finding optimal threshold according to the NCC criterion. First column,
(2)
top: The values of NCC, the eigenvalue λp , NCut and the error for the example shown on the right. Middle: Plot of
(2)
the edge structure. Bottom: Average values plus standard deviation of NCC, NCut, λp and the error for varying p.
Table 2. Results of unnormalized/normalized p-spectral De Bie, T., & Cristianini, N. (2006). Fast SDP re-
clustering on UCI-datasets. For each dataset, the rows
laxations of graph cut clustering, transduction, and
correspond to NCut, NCC resp. RCut, RCC and error.
other combinatorial problems. J. Mach. Learn. Res.,
Normalized Unnormalized
p 2.0 1.4 1.1 2.0 1.4 1.1
7, 1409–1436.
0.0254 0.0229 0.0289 0.0467 0.0332 0.0332
Breast 0.0209 0.0135 0.0174 0.0300 0.0220 0.0220 Fiedler, M. (1973). Algebraic connectivity of graphs.
0.293 0.293 0.293 0.293 0.293 0.293 Czechoslovak Math. J., 23, 298–305.
0.118 0.0796 0.0796 0.108 0.0946 0.0946
Heart 0.0621
0.215
0.0579
0.356
0.0579
0.356
0.0550
0.204
0.0473
0.219
0.0473
0.219
Hagen, L., & Kahng, A. B. (1991). Fast spectral meth-
0.443 0.420 0.420 0.219 0.210 0.210 ods for ratio cut partitioning and clustering. Proc.
Ring
norm
0.222 0.210 0.210 0.109 0.105 0.105 IEEE Intl. Conf. on Computer-Aided Design, 10–13.
0.281 0.288 0.287 0.290 0.310 0.309
0.0821 0.0813 0.0811 0.0392 0.0388 0.0387
Two
0.0411 0.0407 0.0406 0.0196 0.0194 0.0193
Hein, M., Audibert, J.-Y., & von Luxburg, U. (2007).
norm
0.0257 0.0259 0.0261 0.0255 0.0261 0.0269 Graph Laplacians and their convergence on ran-
0.101 0.0857 0.0828 0.0485 0.0410 0.0396
Wave
0.0552 0.0460 0.0438 0.0265 0.0221 0.0210
dom neighborhood graphs. J. Mach. Learn. Res.,
form
0.227 0.211 0.201 0.225 0.212 0.201 8, 1325–1368.
Paige, C., & Saunders, M. (1975). Solution of sparse
References indefinite systems of linear equations. SIAM J. Nu-
Amghibech, S. (2003). Eigenvalues of the discrete p- mer. Anal., 12, 617–629.
Laplacian for graphs. Ars Combin., 67, 283–302.
Shi, J., & Malik, J. (2000). Normalized cuts and im-
Bertsekas, D. (1999). Nonlinear programming. Athena age segmentation. IEEE Trans. Patt. Anal. Mach.
Scientific. Intell., 22, 888–905.
von Luxburg, U. (2007). A tutorial on spectral clus-
Bühler, T., & Hein, M. (2009). Supplemen- tering. Statistics and Computing, 17, 395–416.
tary material. http://www.ml.uni-saarland.de/
Publications/BueHei09tech.pdf. Zhou, D., & Schölkopf, B. (2005). Regularization on
discrete spaces. Deutsche Arbeitsgemeinschaft für
Chung, F. (1997). Spectral graph theory. AMS. Mustererkennung-Symposium (pp. 361–368).

Spectral Clustering Based On The Graph P-Laplacian

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spectral Clustering Based On The Graph P-Laplacian

Uploaded by

Copyright:

Available Formats

Spectral Clustering based on the graph p-Laplacian

Thomas Bühler tb@cs.uni-sb.de

and extensive experiments on various datasets, includ- cut NCC(C, C):

The algorithmic scheme for p-Spectral Clustering is

60 120 250 1000

You might also like