You are on page 1of 15

Pattern Recognition 41 (2008) 176 – 190

www.elsevier.com/locate/pr

A survey of kernel and spectral methods for clustering


Maurizio Filippone a,∗ , Francesco Camastra b , Francesco Masulli a , Stefano Rovetta a
a Department of Computer and Information Science, University of Genova, and CNISM, Via Dodecaneso 35, I 16146 Genova, Italy
b Department of Applied Science, University of Naples Parthenope, Via A. De Gasperi 5, I 80133 Napoli, Italy

Received 19 October 2006; received in revised form 30 April 2007; accepted 29 May 2007

Abstract
Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the
partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present
a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The
presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral
clustering arise from concepts in spectral graph theory and the clustering problem is configured as a graph cut problem where an appropriate
objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has
been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods
are presented as extensions of kernel K-means clustering algorithm.
䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Partitional clustering; Mercer kernels; Kernel clustering; Kernel fuzzy clustering; Spectral clustering

1. Introduction to choose an appropriate similarity measure among patterns.


The most popular dissimilarity measure for metric representa-
Unsupervised data analysis using clustering algorithms pro- tions is the distance, for instance the Euclidean one [3].
vides a useful tool to explore data structures. Clustering meth- Clustering techniques can be roughly divided into two cate-
ods [1,2] have been addressed in many contexts and disciplines gories:
such as data mining, document retrieval, image segmentation
and pattern classification. The aim of clustering methods is to • hierarchical;
group patterns on the basis of a similarity (or dissimilarity) • partitioning.
criteria where groups (or clusters) are set of similar patterns.
Crucial aspects in clustering are pattern representation and the Hierarchical clustering techniques [1,4,5] are able to find struc-
similarity measure. Each pattern is usually represented by a tures which can be further divided in substructures and so on re-
set of features of the system under study. It is very important cursively. The result is a hierarchical structure of groups known
to notice that a good choice of representation of patterns can as dendrogram.
lead to improvements in clustering performance. Whether it is Partitioning clustering methods try to obtain a single parti-
possible to choose an appropriate set of features depends on the tion of data without any other sub-partition like hierarchical
system under study. Once a representation is fixed it is possible algorithms do and are often based on the optimization of an
appropriate objective function. The result is the creation of
separations hypersurfaces among clusters. For instance we can
∗ Corresponding author. Tel.: +39 010 3536610; fax: +39 010 3536699.
consider two nonlinear clusters as in Fig. 1. Standard partition-
ing methods (e.g., K-means, fuzzy c-means, SOM and neural
E-mail addresses: filippone@disi.unige.it (M. Filippone),
francesco.camastra@uniparthenope.it gas) using two centroids are not able to separate in the desired
(F. Camastra), masulli@disi.unige.it (F. Masulli), rovetta@disi.unige.it way the two rings. The use of many centroids could solve this
(S. Rovetta). problem providing a complex description of a simple data set.
0031-3203/$30.00 䉷 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2007.05.018
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 177

6 results on the capabilities and convergence properties of spectral


methods for clustering [13–16]. Recently, kernel methods have
4
been applied to fuzzy c-varieties also [17] with the aim of
2 finding varieties in feature space and there are some interesting
clustering methods using kernels such as Refs. [18,19].
0
x2

Since the choice of the kernel and of the similarity measure is


−2 crucial in these methods, many techniques have been proposed
in order to learn automatically the shape of kernels from data
−4 as in Refs. [20–23].
−6 Regarding the applications, most of these algorithms (e.g.,
Refs. [17,21,24]) have been applied to standard benchmarks
−6 −4 −2 0 2 4 6 such as Ionosphere [25], Breast Cancer [26] and Iris [27].1
x1
Kernel fuzzy c-means proposed in Refs. [28–30] has been ap-
Fig. 1. A data set composed of two rings of points.
plied in image segmentation problems while in Ref. [31] it has
been applied in handwritten digits recognition. There are appli-
cations of kernel clustering methods in face recognition using
For this reason several modifications and new approaches have kernel SOM [32], in speech recognition [33] and in predic-
been introduced to cope with this problem. tion of crop yield from climate and plantation data [34]. Spec-
Among the large amount of modifications we can mention tral methods have been applied in clustering of artificial data
the fuzzy c-varieties [6], but the main drawback is that some a [35,36], in image segmentation [23,37,38], in bioinformatics
priori information on the shape of clusters must be included. [39], and in co-clustering problems of words and documents
Recently, some clustering methods that produce nonlinear [40] and genes and conditions [41]. A semi-supervised spectral
separating hypersurfaces among clusters have been proposed. approach to bioinformatics and handwritten character recog-
These algorithms can be divided in two big families: kernel and nition have been proposed in Ref. [42]. The protein sequence
spectral clustering methods. clustering problem has been faced using spectral techniques in
Regarding kernel clustering methods, several clustering Ref. [43] and kernel methods in Ref. [44].
methods have been modified incorporating kernels (e.g., In the next section we briefly introduce the concepts of linear
K-means, fuzzy c-means, SOM and neural gas). The use of partitioning methods by recalling some basic crisp and fuzzy
kernels allows to map implicitly data into an high dimensional algorithms. Then the paper is organized as follows: Section 3
space called feature space; computing a linear partitioning in shows the kernelized version of the algorithms presented in
this feature space results in a nonlinear partitioning in the input Section 2, in Section 4 we discuss spectral clustering, while
space. in Section 5 we report the equivalence between spectral and
Spectral clustering methods arise from concepts in spectral kernel clustering methods. In the last section conclusions are
graph theory. The basic idea is to construct a weighted graph drawn.
from the initial data set where each node represents a pattern
and each weighted edge simply takes into account the simi- 2. Partitioning methods
larity between two patterns. In this framework the clustering
problem can be seen as a graph cut problem, which can be In this section we briefly recall some basic facts about par-
tackled by means of the spectral graph theory. The core of this titioning clustering methods and we will report the clustering
theory is the eigenvalue decomposition of the Laplacian ma- methods for which a kernel version has been proposed. Let
trix of the weighted graph obtained from data. In fact, there is X={x1 , . . . , xn } be a data set composed by n patterns for which
a close relationship between the second smallest eigenvalue of every xi ∈ Rd . The codebook (or set of centroids) V is defined
the Laplacian and the graph cut [7,8]. as the set V = {v1 , . . . , vc }, typically with c>n. Each element
The aim of this paper is to present a survey of kernel and spec-
vi ∈ Rd is called codevector (or centroid or prototype).2
tral clustering methods. Moreover, an explicit proof of the fact
The Voronoi region Ri of the codevector vi is the set of
that these two approaches have the same mathematical foun-
vectors in Rd for which vi is the nearest vector:
dation is reported. In particular it has been shown by Dhillon
et al. that kernel K-means and spectral clustering with the ra- Ri = {z ∈ Rd |i = arg min z − vj 2 }. (1)
tio association as the objective function are perfectly equiva- j
lent [9–11]. The core of both approaches lies in their ability
to construct an adjacency structure between data avoiding to It is possible to prove that each Voronoi region is convex [45]
deal with a prefixed shape of clusters. These approaches have and the boundaries of the regions are linear segments.
a slight similarity with hierarchical methods in the use of an
adjacency structure with the main difference in the philosophy 1 These data sets can be found at: (ftp://ftp.ics.uci.edu/pub/machine-
of the grouping procedure. learning-databases/).
A comparison of some spectral clustering methods has been 2 Among many terms to denote such objects, we will use codevectors
recently proposed in Ref. [12], while there are some theoretical as in vector quantization theory.
178 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

Batch K-means can be viewed as an expectation–maximization


[48] algorithm, ensuring the convergence after a finite number
of steps.
This approach presents many disadvantages [3]. Local min-
ima of E(X) make the method dependent on initialization, and
the average is sensitive to outliers. Moreover, the number of
clusters to find must be provided, and this can be done only
using some a priori information or additional validity crite-
rion. Finally, K-means can deal only with clusters with spheri-
cally symmetrical point distribution, since Euclidean distances
Fig. 2. An example of Voronoi tessellation where each black point is a of patterns from centroids are computed leading to a spheri-
codevector. cal invariance. Different distances lead to different invariance
properties as in the case of Mahalanobis distance which pro-
The definition of the Voronoi set i of the codevector vi is duces invariance on ellipsoids [3].
straightforward. It is the subset of X for which the codevector The term batch means that at each step the algorithm takes
vi is the nearest vector: into account the whole data set to update the codevectors. When
the cardinality n of the data set X is very high (e.g., several
i = {x ∈ X|i = arg min x − vj 2 }, (2) hundreds of thousands) the batch procedure is computation-
j
ally expensive. For this reason an on-line update has been in-
that is, the set of vectors belonging to Ri . A partition on Rd troduced leading to the on-line K-means algorithm [45,49]. At
induced by all Voronoi regions is called Voronoi tessellation or each step, this method simply randomly picks an input pattern
Dirichlet tessellation (Fig. 2). and updates its nearest codevector, ensuring that the schedul-
ing of the updating coefficient is adequate to allow convergence
and consistency.
2.1. Batch K-means

A simple algorithm able to construct a Voronoi tessellation 2.2. Self-organizing maps (SOM)
of the input space was proposed in 1957 by Lloyd [46] and
it is known as batch K-means. Starting from the finite data A self-organizing map (SOM) [50] also known as self-
set X this algorithm moves iteratively the k codevectors to the organizing feature map (SOFM) represents data by means of
arithmetic mean of their Voronoi sets {i }i=1,...,k . Theoretically codevectors organized on a grid with fixed topology. Code-
speaking, a necessary condition for a codebook V to minimize vectors move to adapt to the input distribution, but adaptation
the empirical quantization error: is propagated along the grid also to neighboring codevectors,
according to a given propagation or neighborhood function.
1  
k This effectively constrains the evolution of codevectors. Grid
E(X) = x − vi 2 , (3) topologies may differ, but in this paper we consider a two-
2n x∈
i=1 i dimensional, square-mesh topology [51,52]. The distance on
is that each codevector vi fulfills the centroid condition [47]. the grid is used to determine how strongly a codevector is
In the case of a finite data set X and with Euclidean distance, adapted when the unit aij is the winner. The metric used on
the centroid condition reduces to a rectangular grid is the Manhattan distance, for which the
distance between two elements r = (r1 , r2 ) and s = (s1 , s2 ) is
1 
vi = x. (4)
|i | x∈ drs = |r1 − s1 | + |r2 − s2 |. (5)
i

Batch K-means is formed by the following steps: The SOM algorithm is the following:

(1) choose the number k of clusters; (1) Initialize the codebook V randomly picking from X.
(2) initialize the codebook V with vectors randomly picked (2) Initialize the set C of connections to form the rectangular
from X; grid of dimension n1 × n2 .
(3) compute the Voronoi set i associated to the codevector vi ; (3) Initialize t = 0.
(4) move each codevector to the mean of its Voronoi set using (4) Randomly pick an input x from X.
Eq. (4); and (5) Determine the winner
(5) return to step 3 if any codevector has changed otherwise
return the codebook. s(x) = arg min x − vj . (6)
vj ∈V

At the end of the algorithm a codebook is found and a Voronoi (6) Adapt each codevector:
tessellation of the input space is provided. It is guaranteed that
after each iteration the quantization error does not increase. vj = (t)h(drs )(x − vj ), (7)
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 179

where h is a decreasing function of d as for instance: 2.4. Fuzzy clustering methods


 
d2 Bezdek [6] introduced the concept of hard and fuzzy par-
h(drs ) = exp − 2rs . (8)
2 (t) tition in order to extend the notion of membership of pattern
to clusters. The motivation of this extension is related to the
(7) Increment t.
fact that a pattern often cannot be thought of as belonging to a
(8) If t < tmax go to step 4.
single cluster only. In many cases, a description in which the
(t) and (t) are decreasing functions of t, for example [53]: membership of a pattern is shared among clusters is necessary.
 t/tmax  t/tmax
f f Definition 2.1. Let Acn denote the vector space of c × n real
(t) = i , (t) = i , (9) matrices over R. Considering X, Acn and c ∈ N such that
i i
2 c < n, the Fuzzy c-partition space for X is the set:
where i , f and i , f are the initial and final values for the  
 c
functions (t) and (t). 
Mfc = U ∈ Acn  uih ∈ [0, 1] ∀i, h; uih = 1 ∀h;
A final note on the use of SOM for clustering. The method 
i=1
was originally devised as a tool for embedding multidimen- 
n
sional data into typically two dimensional spaces, for data 0< uih < n ∀i . (11)
visualization. Since then, it has also been frequently used as a h=1
clustering method, which was originally not considered appro-
priate because of the constraints imposed by the topology. How- The matrix U is the so called membership matrix since each
ever, the topology itself serves an important purpose, namely, element uih is the fuzzy membership of the hth pattern to the
that of limiting the flexibility of the mapping in the first training ith cluster. The definition of Mfc simply tells that the sum
cycles, and gradually increasing it (while decreasing the mag- of the memberships of a pattern to all clusters is one (proba-
nitude of updates to ensure convergence) as more cycles were bilistic constraint) and that a cluster cannot be empty or con-
performed. The strategy is similar to that of other algorithms, tain all patterns. This definition generalizes the notion of hard
including these described in the following, in the capacity con- c-partitions in Ref. [6].
trol of the method which has the effect of avoiding local min- The mathematical tool used in all these methods for working
ima. This accounts for the fast convergence often reported in out the solution procedure is the Lagrange multipliers tech-
experimental works. nique. In particular a minimization of the intraclusters distance
functional with a probabilistic constraint on the member-
2.3. Neural gas ships of a pattern to all clusters has to be achieved. Since
all the functionals involved in these methods depend on both
Another technique that tries to minimize the distortion error memberships and codevectors, the optimization is iterative and
is the neural gas algorithm [54], based on a soft adaptation rule. follows the so called Picard iterations method [6] where each
This technique resembles the SOM in the sense that not only the iteration is composed of two steps. In the first step a subset of
winner codevector is adapted. It is different in that codevectors variables (memberships) is kept fixed and the optimization is
are not constrained to be on a grid, and the adaptation of the performed with respect to the remaining variables (codevec-
codevectors near the winner is controlled by a criterion based tors) while in the second one the role of the fixed and moving
on distance ranks. Each time a pattern x is presented, all the variables is swapped. The optimization algorithm stops when
codevectors vj are ranked according to their distance to x (the variables change less than a fixed threshold.
closest obtains the lowest rank). Denoting with j the rank of
the distance between x and the codevector vj , the update rule is 2.4.1. Fuzzy c-means
The fuzzy c-means algorithm [6] identifies clusters as fuzzy
vj = (t)h (j )(x − vj ), (10)
sets. It minimizes the functional:
with (t) ∈ [0, 1] gradually lowered as t increases and h (j ) 
n 
c
a function decreasing with j with a characteristic decay ; J (U, V ) = (uih )m xh − vi 2 , (12)
usually h (j ) = exp(−j /). The Neural Gas algorithm is the h=1 i=1
following: with respect to the membership matrix U and the codebook V
with the probabilistic constraints:
(1) Initialize the codebook V randomly picking from X.
(2) Initialize the time parameter t = 0. 
c
(3) Randomly pick an input x from X. uih = 1 ∀i = 1, . . . , n. (13)
(4) Order all elements vj of V according to their distance to i=1
x, obtaining the j . The parameter m controls the fuzziness of the memberships
(5) Adapt the codevectors according to Eq. (10). and usually it is set to two; for high values of m the algorithm
(6) Increase the time parameter t = t + 1. tends to set all the memberships equals while for m tending to
(7) If t < tmax go to step 3. one we obtain the K-means algorithm where the memberships
180 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

are crisp. The minimization of Eq. (12) is done introducing a The minimization of Eqs. (17) and (18) with respect to the uih
Lagrangian function for each pattern for which the constraint leads, respectively, to the following equations:
is in Eq. (13).  1/(m−1) −1
 c xh − vi 2
c  uih = 1 + , (19)
Lh = (uih )m xh − vi 2 + h uih − 1 . (14) i
i=1 i=1  
xh − vi 2
uih = exp − . (20)
Then the derivatives of the sum of the Lagrangian are computed i
with respect to the uih and vi and are set to zero. This yields
The constraint on the memberships uih ∈ [0, 1] is automati-
the iteration scheme of these equations:
cally satisfied given the form assumed by Eqs. (19) and (20).
c 
  The updates of the centroids for PCM-I and PCM-II are,
xh − vi  2/(m−1)
u−1
ih = , (15) respectively:
xh − vj 
n
j =1
(uih )m xh

n vi =
h=1 n m , (21)
(uih )m xh h=1 (uih )
vi =
h=1
n m . (16)
n
h=1 (uih ) uih xh
vi =
h=1 n . (22)
At each iteration it is possible to evaluate the amount of change h=1 uih
of the memberships and codevectors and the algorithm can be The parameter i regulates the trade-off between the two terms
stopped when these quantities reach a predefined threshold. At in Eq. (17) and Eq. (18) and it is related to the width of the
the end a soft partitioning of the input space is obtained. clusters. The authors suggest to estimate i for PCM-I using
this formula:
2.5. Possibilistic clustering methods
n
(uih )m xh − vi 2
i = h=1
n m , (23)
As a further modification of the K-means algorithm, the pos- h=1 (uih )
sibilistic approach [55,56] relaxes the probabilistic constraint which is a weighted mean of the intracluster distance of the ith
on the membership of a pattern to all clusters. In this way a cluster and the constant is typically set at one. The param-
pattern can have a low membership to all clusters in the case eter i can be estimated with scale estimation techniques as
of outliers, whereas for instance, in the situation of overlapped developed in the robust clustering literature for M-estimators
clusters, it can have high membership to more than one clus- [57,58].The value of i can be updated at each step of the al-
ter. In this framework the membership represents a degree of gorithm or can be fixed for all iterations. The former approach
typicality not depending on the membership values of the same can lead to instabilities since the derivation of the equations has
pattern to other clusters. Again the optimization procedure is been obtained considering i fixed. In the latter case a good es-
the Picard iteration method, since the functional depends both timation of i can be done only starting from an approximate
on memberships and codevectors. solution. For this reason often the possibilistic c-means is run
as a refining step of a fuzzy c-means.
2.5.1. Possibilistic c-means
There are two formulations of the possibilistic c-means, that 3. Kernel clustering methods
we will call PCM-I [55] and PCM-II [56]. The first one aims
to minimize the following functional with respect to the mem- In machine learning, the use of the kernel functions [59]
bership matrix U and the codebook V = {v1 , . . . , vc }: has been introduced by Aizerman et al. [60] in 1964. In 1995
Cortes and Vapnik introduced support vector machines (SVMs)

n 
c
[61] which perform better than other classification algorithms
J (U, V ) = (uih )m xh − vi 2
in several problems. The success of SVM has brought to extend
h=1 i=1
the use of kernels to other learning algorithms (e.g., Kernel

c 
n PCA [62]). The choice of the kernel is crucial to incorporate a
+ i (1 − uih )m , (17) priori knowledge on the application, for which it is possible to
i=1 h=1 design ad hoc kernels.
while the second one addresses the functional:
3.1. Mercer kernels

n 
c
J (U, V ) = uih xh − vi 2 We recall the definition of Mercer kernels [63,64], consider-
h=1 i=1 ing, for the sake of simplicity, vectors in Rd instead of Cd .

c 
n
+ i (uih ln(uih ) − uih ) . (18) Definition 3.1. Let X = {x1 , . . . , xn } be a nonempty set where
i=1 h=1 xi ∈ Rd . A function K : X × X → R is called a positive
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 181

definite kernel (or Mercer kernel) if and only if K is symmetric elements of a set X and possibly, for some of these represen-
(i.e. K(xi , xj ) = K(xj , xi )) and the following equation holds: tations, the clusters can be easily identified.
In literature there are some applications of kernels in cluster-

n 
n
ing. These methods can be broadly divided in three categories,
ci cj K(xi , xj ) 0 ∀n 2, (24)
which are based, respectively, on:
i=1 j =1

where cr ∈ R ∀r = 1, . . . , n. • kernelization of the metric [29,68,69];


• clustering in feature space [31,70–73];
Each Mercer kernel can be expressed as follows: • description via support vectors [24,74].
K(xi , xj ) =
(xi ) ·
(xj ), (25) Methods based on kernelization of the metric look for centroids
where
: X → F performs a mapping from the input space in input space and the distances between patterns and centroids
X to a high dimensional feature space F. One of the most is computed by means of kernels:
relevant aspects in applications is that it is possible to compute 
(xh ) −
(vi )2 = K(xh , xh ) + K(vi , vi ) − 2K(xh , vi ).
Euclidean distances in F without knowing explicitly
. This
(32)
can be done using the so called distance kernel trick [62,65]:
Clustering in feature space is made by mapping each pattern

(xi ) −
(xj )2 using the function
and then computing centroids in feature
= (
(xi ) −
(xj )) · (
(xi ) −
(xj )) space. Calling vi
the centroids in feature space, we will see
in the next sections that it is possible to compute the distances
=
(xi ) ·
(xi ) +
(xj ) ·
(xj ) − 2
(xi ) ·
(xj ) 
(xh ) − vi
2 by means of the kernel trick.
= K(xi , xi ) + K(xj , xj ) − 2K(xi , xj ) (26) The description via support vectors makes use of One Class
SVM to find a minimum enclosing sphere in feature space able
in which the computation of distances of vectors in feature to enclose almost all data in feature space excluding outliers.
space is just a function of the input vectors. In fact, every The computed hypersphere corresponds to nonlinear surfaces
algorithm in which input vectors appear only in dot products in input space enclosing groups of patterns. The support vector
with other input vectors can be kernelized [66]. In order to clustering algorithm allows to assign labels to patterns in input
simplify the notation we introduce the so called Gram matrix space enclosed by the same surface. In the next subsections we
K where each element kij is the scalar product
(xi ) ·
(xi ). will outline these three approaches.
Thus, Eq. (26) can be rewritten as

(xi ) −
(xj )2 = kii + kjj − 2kij . (27) 3.2. Kernel K-means

Examples of Mercer kernels are the following [67]: Given the data set X, we map our data in some feature space
F, by means of a nonlinear map
and we consider k centers
• Linear: in feature space (vi
∈ F with i = 1, . . . , k) [62,75]. We call
the set V
= (v1
, . . . , vk
) Feature Space Codebook since in
K (l) (xi , xj ) = xi · xj . (28)
our representation the centers in the feature space play the same
• Polynomial of degree p: role of the codevectors in the input space. In analogy with the
codevectors in the input space, we define for each center vi
its
K (p) (xi , xj ) = (1 + xi · xj )p , p ∈ N. (29) Voronoi region and Voronoi set in feature space. The Voronoi
• Gaussian: region in feature space (Ri
) of the center vi
is the set of all
  vectors in F for which vi
is the closest vector
xi − xj 2
K (g)
(xi , xj ) = exp − ,  ∈ R. (30) Ri
= {x
∈ F|i = arg min x
− vj
}. (33)
22 j

It is important to stress that the use of the linear kernel in Eq. The Voronoi set in feature space 

i of the center vi is the set


(26) simply leads to the computation of the Euclidean norm in

of all vectors x in X such that vi is the closest vector to their


the input space. Indeed: images
(x) in the feature space:
xi − xj 2 = xi · xi + xj · xj − 2xi · xj 

i = {x ∈ X|i = arg min 


(x) − vj }. (34)
j
= K (xi , xi ) + K (xj , xj ) − 2K (xi , xj )
(l) (l) (l)
The set of the Voronoi regions in feature space define a Voronoi
2
=
(xi ) −
(xj ) , (31) Tessellation of the feature space. The Kernel K-means algorithm
has the following steps:
shows that choosing the kernel K (l) implies
= I (where I
is the identity function). Following this consideration we can (1) Project the data set X into a feature space F, by means of
think that kernels can offer a more general way to represent the a nonlinear mapping
.
182 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

(2) Initialize the codebook V


= (v1
, . . . , vk
) with vi
∈ F. during the annealing and at the end of the procedure the mem-
(3) Compute for each center vi
the set 
i .
berships have become crisp; therefore a tessellation of the fea-

(4) Update the codevectors vi in F ture space is found. This linear partitioning in F, back to the
input space, forms a nonlinear partitioning of the input space.
1 
vi
=
(x). (35)
|

i | x∈
3.3. Kernel SOM
i

(5) Go to step 3 until any vi


changes. The kernel version of the SOM algorithm [70,71] is based
(6) Return the feature space codebook. on the distance kernel trick. The method tries to adapt the grid
of codevectors vj
in feature space. In kernel SOM we start
This algorithm minimizes the quantization error in feature writing each codevector as a combination of points in feature
space. space:
Since we do not know explicitly
it is not possible to com-

n
pute directly Eq. (35). Nevertheless, it is always possible to vj
= j h
(xh ), (40)
compute distances between patterns and codevectors by using h=1
the kernel trick, allowing to obtain the Voronoi sets in feature
space 
where the coefficients ih are initialized once the grid is created.
i . Indeed, writing each centroid in feature space as a
combination of data vectors in feature space we have Computing the winner by writing Eq. (6) in feature space leads
to:

n
vj
= j h
(xh ), (36) s(
(xi )) = arg min 
(xi ) − vj
, (41)
h=1 vj
∈V

where j h is one if xh ∈ 
j and zero otherwise. Now the
that can be written, using the kernel trick:
quantity 
  
2 s(
(xi )) = arg min kii − 2 j h kih j r j s krs .

n
vj
∈V

(xi ) − vj
2 =
(xi ) − j h
(xh ) (37) h r s
(42)
h=1

can be expanded by using the scalar product and the kernel To update the codevectors we rewrite Eq. (7):
trick in Eq. (26):
vj
 = vj
+ (t)h(drs )(
(x) − vj
). (43)
2
 n 
Using Eq. (40):

(xi ) − j h
(xh ) = kii − 2 j h kih

h=1 h 
n 
n
  j h
(xh ) = j h
(xh ) + (t)h(drs )
+ j r j s krs . (38)
h=1 h=1
r s 

n
This allows to compute the closest feature space codevector for ×
(x) − j h
(xh ) . (44)
each pattern and to update the coefficients j h . It is possible to h=1
repeat these two operations until any j h changes to obtain a
Voronoi tessellation of the feature space. Thus the rule for the update of j h is
An on-line version of the kernel K-means algorithm can be 
(1 − (t)h(drs )) j h if i  = j,
found in Ref. [62]. A further version of K-means in feature j h = (45)
space has been proposed by Girolami [75]. In his formulation (1 − (t)h(drs )) j h + (t)h(drs ) otherwise.
the number of clusters is denoted by c and a fuzzy membership
matrix U is introduced. Each element uih denotes the fuzzy 3.4. Kernel neural gas
membership of the point xh to the Voronoi set 
i . This algo-
rithm tries to minimize the following functional with respect The neural gas algorithm provides a soft update rule for the
to U : codevectors in input space. The kernel version of neural gas
[72] applies the soft rule for the update to the codevectors in

n 
c
J
(U, V
) = uih 
(xh ) − vi
2 . (39) feature space. Rewriting Eq. (10) in feature space for the update
of the codevectors we have
h=1 i=1

The minimization technique used by Girolami is deterministic vj


= h (j )(
(x) − vj
). (46)
annealing [76] which is a stochastic method for optimization.
A parameter controls the fuzziness of the membership during Here j is the rank of the distance 
(x) − vj
. Again it is
the optimization and can be thought proportional to the temper- possible to write vj
as a linear combination of
(xi ) as in
ature of a physical system. This parameter is gradually lowered Eq. (40), allowing to compute such distances by means of the
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 183

kernel trick. As in the kernel SOM technique, the updating rule • when j = 0 and 0 < j < C, the image of xj lies on the
for the centroids becomes an updating rule for the coefficients surface of the hypersphere. These points are called support
of such combination. vectors.

3.5. One class SVM Moreover, it is possible to write the Wolfe dual form [77], whose
optimization leads to this quadratic programming problem with
This approach provides a support vector description in fea- respect to the j :
ture space [74,77,78]. The idea is to use kernels to project data  
into a feature space and then to find the sphere enclosing al- JW = kjj j − kij i j . (51)
most all data, namely not including outliers. Formally, a radius j i j
R and the center v of the smallest enclosing sphere in feature The distance from the image of a point xj and the center v of
space are defined. The constraint is thus: the enclosing sphere can be computed as follows:

(xj ) − v2 R 2 + j ∀j , (47)  
dj = 
(xj ) − v2 = kjj − 2 r kj r + r s krs .
r r s
where the non negative slack variables j have been added.
(52)
The Lagrangian for this problem is defined [79]:
 In Fig. 3 it is possible to see the ability of this algorithm to find
L = R2 − (R 2 + j − 
(xj ) − v2 ) j the smallest enclosing sphere without outliers.
j
  3.5.1. Support vector clustering
− j j + C j , (48)
Once boundaries in input space are found, a labeling proce-
j j
dure is necessary in order to complete clustering. In Ref. [74]
where j 0
and j 0 are Lagrange multipliers, C is a con- the cluster assignment procedure follows a simple geometric
stant and C j j is a penalty term. Computing the partial idea. Any path connecting a pair of points belonging to dif-
derivative of L with respect to R, v and j and setting them to ferent clusters must exit from the enclosing sphere in feature
zero leads to the following equations: space. Denoting with Y the image in feature space of one of
  such paths and with y the elements of Y , it will result that
j = 1, v = j
(xj ), j = C − j . (49) R(y) > R for some y. Thus it is possible to define an adjacency
j j structure in this form:

The Karush–Kuhn–Tucker (KKT) complementary conditions 1 if R(y) < R ∀y ∈ Y,
(53)
[79] result in 0 otherwise.

j j = 0, (R 2 + j − 
(xj ) − v2 ) j = 0. (50) Clusters are simply the connected components of the graph
with the adjacency matrix just defined. In the implementation in
Following simple considerations regarding all these conditions Ref. [77] the check is made sampling the line segment Y in 20
it is possible to see that: equidistant points.There are some modifications on this label-
ing algorithm (e.g., Refs. [80,81]) that improve performances.
• when j > 0, the image of xj lies outside the hypersphere. An improved version of SVC algorithm with application in
These points are called bounded support vectors; and handwritten digits recognition can be found in Ref. [82].

6
5
4

2
0
x2

x2

−2
−5
−4

−6

−5 0 5 −6 −4 −2 0 2 4 6
x1 x1

Fig. 3. One class SVM applied to two data sets with outliers. The gray line shows the projection in input space of the smallest enclosing sphere in feature
space. In: (a) a linear kernel and (b) a Gaussian kernel have been used.
184 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

3.5.2. Camastra and Verri algorithm We obtain for the memberships:


A technique combining K-means and One Class SVM can c  
be found in Ref. [24]. The algorithm uses a K-means-like strat-  1 − K(xh , vi ) 1/(m−1)
u−1
ih = , (57)
egy, i.e., moves repeatedly all centers vi
in the feature space, 1 − K(xh , vj )
j =1
computing One Class SVM on their Voronoi sets 
i , until no
center changes anymore. Moreover, in order to introduce ro- and for the codevectors:
bustness against outliers, the authors have proposed to compute
n
(uih )m K(xh , vi )xh
One Class SVM on 

i () of each center vi . The set i () is vi =


h=1n m . (58)
defined as h=1 (uih ) K(xh , vi )

i () = {xj ∈ i and 


(xj ) − vi  < }. (54) 3.6.2. Kernel fuzzy c-means in feature space
Here we derive the fuzzy c-means in feature space, which is a

i () is the Voronoi set in the feature space of the center vi

clustering method which allows to find a soft linear partitioning


without outliers, that is the images of data points whose distance of the feature space. This partitioning, back to the input space,
from the center is larger than . The parameter  can be set up results in a soft nonlinear partitioning of data. The functional to
using model selection techniques [48] (e.g., cross-validation). optimize [31,73] with the probabilistic constraint in Eq. (13) is
In summary, the algorithm has the following steps:

n 
c
J
(U, V
) = (uih )m 
(xh ) − vi
2 . (59)
(1) Project the data set X into a feature space F, by means of
h=1 i=1
a nonlinear mapping
.
(2) Initialize the codebook V
= (v1
, . . . , vk
) with vi
∈ F. It is possible to rewrite the norm in Eq. (59) explicitly by using:
(3) Compute 

i () for each center vi .


n

(uih )m
(xh ) 
n
(4) Apply One Class SVM to each i () and assign the center

vi = h=1
n = a i (uih )m
(xh ), (60)
m
obtained to vi
. h=1 (uih ) h=1
(5) Go to step 2 until any vi
changes.
which is the kernel version of Eq. (16). For simplicity of nota-
(6) Return the feature space codebook.
tion we used:

3.6. Kernel fuzzy clustering methods 


n
ai−1 = (uir )m . (61)
r=1
Here we show some kernelized versions of fuzzy c-means al-
gorithms, showing in particular fuzzy and possibilistic c-means. Now it is possible to write the kernel version of Eq. (15):



1/(m−1)
 c
khh − 2ai nr=1 (uir )m khr + ai2 nr=1 ns=1 (uir )m (uis )m krs
−1
uih =
n

n
2 n
. (62)
j =1
k hh − 2a j r=1 (u j r ) m
k hr + a j r=1 s=1 (u j r )m
(u j s )m
k rs

Eq. (62)gives the rule for the update of the membership uih .
In the first subsection we show the method of the kernelization
of the metric while in the second one the fuzzy c-means in 3.6.3. Possibilistic c-means with the kernelization of the metric
feature space is shown. The third subsection is devoted to the The formulation of the possibilistic c-means PCM-I with
kernelized version of the possibilistic c-means. the kernelization of the metric used in Ref. [69] involves the
minimization of the following functional:
3.6.1. Kernel fuzzy c-means with kernelization of the metric
The basic idea is to minimize the functional [29,68,69]: 
n 
c
J
(U, V ) = (uih )m 
(xh ) −
(vi )2

n 
c h=1 i=1
J
(U, V ) = (uih )m 
(xh ) −
(vi )2 , (55)  c 
n
h=1 i=1 + i (1 − uih )m . (63)
i=1 h=1
with the probabilistic constraint over the memberships
(Eq. (13)). The procedure for the optimization of J
(U, V ) Minimization leads to
is again the Picard iteration technique. Minimization of the  1/(m−1)

(xh ) −
(vi )2
functional in Eq. (55) has been proposed only in the case of u−1 = 1 + , (64)
a Gaussian kernel K (g) . The reason is that the derivative of
ih
i
J
(U, V ) with respect to the vi using a Gaussian kernel is
that can be rewritten, considering a Gaussian kernel, as
particularly simple since it allows to use the kernel trick:
 
jK(xh , vi ) (xh − vi ) 1 − K(xh , vi ) 1/(m−1)
= K(xh , vi ). (56) u−1 = 1 + 2 . (65)
jvi 2
ih
i
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 185

The update of the codevectors follows: optimize the normalized cut [37]:

n  
(uih )m K(xh , vi )xh 1 1
vi =
h=1 . (66) N cut(S, S̄) = cut(S, S̄) + , (71)
n m assoc(S, V ) assoc(S̄, V )
h=1 (uih ) K(xh , vi )

where the association assoc(S, V ) is also known as the volume


The computation of the i is straightforward.
of S:
 
4. Spectral clustering assoc(S, V ) = aij ≡ vol(S) = dii . (72)
vi ∈S,vj ∈V vi ∈S

Spectral clustering methods [39] have a strong connection


with graph theory [7,83]. A comparison of some spectral clus- There are other definitions of functions to optimize (e.g., the
tering methods has been recently proposed in Ref. [12]. Let conductance [13], the normalized association [37], ratio cut
X = {x1 , . . . , xn } be the set of patterns to cluster. Starting [10]).
from X, we can build a complete, weighted undirected graph The complexity in optimizing these objective functions is
G(V , A) having a set of nodes V = {v1 , . . . , vn } corresponding very high (e.g., the optimization of the normalized cut is a NP-
to the n patterns and edges defined through the n × n adjacency hard problem [37,84]) and for this reason it has been proposed
(also affinity) matrix A. The adjacency matrix for a weighted to relax it by using spectral concepts of graph analysis. This
graph is given by the matrix whose element aij represents the relaxation can be formulated by introducing the Laplacian ma-
weight of the edge connecting nodes i and j . Being an undi- trix [7]:
rected graph, the property aij = aj i holds. Adjacency between
L = D − A, (73)
two patterns can be defined as follows:
 which can be seen as a linear operator on G. In addition to this
h(xi , xj ) if i  = j,
aij = (67) definition of Laplacian there are alternative definitions:
0 otherwise.
• Normalized Laplacian LN = D −1/2 LD −1/2 .
The function h measures the similarity between patterns and • Generalized Laplacian LG = D −1 L.
typically a Gaussian function is used: • Relaxed Laplacian L = L − D.
 
d(xi , xj ) Each definition is justified by special properties desirable in
h(xi , xj ) = exp − , (68)
22 a given context. The spectral decomposition of the Laplacian
matrix can give useful information about the properties of the
where d measures the dissimilarity between patterns and  graph. In particular it can be seen that the second smallest eigen-
controls the rapidity of decay of h. This particular choice has value of L is related to the graph cut [8] and the corresponding
the property that A has only some terms significantly different eigenvector can cluster together similar patterns [7,37,85].
from 0, i.e., it is sparse. Spectral approach to clustering has a strong connection with
The degree matrix D is the diagonal matrix whose elements Laplacian eigenmaps [86]. The dimensionality reduction prob-
are the degrees of the nodes of G. lem aims to find a proper low dimensional representation of a
data set in a high dimensional space. In Ref. [86], each node

n in the graph, which represents a pattern, is connected just with
dii = aij . (69) nodes corresponding to neighboring patterns and the spectral
j =1 decomposition of the Laplacian of the obtained graph permits
to find a low dimensional representation of X. The authors
In this framework the clustering problem can be seen as a graph point out the close connection with spectral clustering and local
cut problem [7] where one wants to separate a set of nodes linear embedding [87] providing theoretical and experimental
S ⊂ V from the complementary set S̄ = V \S. The graph cut validations.
problem can be formulated in several ways depending on the
choice of the function to optimize. One of the most popular
4.1. Shi and Malik algorithm
functions to optimize is the cut [7]:
 The algorithm proposed by Shi and Malik [37] applies
cut(S, S̄) = aij . (70) the concepts of spectral clustering to image segmentation
vi ∈S,vj ∈S̄ problems. In this framework each node is a pixel and the
definition of adjacency between them is suitable for image seg-
It is easy to verify that the minimization of this objective func- mentation purposes. In particular, if xi is the position of the ith
tion favors partitions containing isolated nodes. To achieve a pixel and fi a feature vector which takes into account several
better balance in the cardinality of S and S̄ it is suggested to of its attributes (e.g., intensity, color and texture information),
186 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

they define the adjacency as (3) Compute a normalized version of A, defining this Lapla-
cian:

fi − fj 2 L = D −1/2 AD −1/2 . (76)
aij = exp −
221
⎧  (4) Find the k eigenvectors {e1 , . . . , ek } of L associated to the

⎨ exp − x i − x j  2 largest eigenvalues {1 , . . . , k }.
if xi − xj  < R, (5) Form the matrix Z by stacking the k eigenvectors in
× 222 (74)

⎩ columns.
0 otherwise. (6) Compute the matrix Y by normalizing each of the Z’s rows
to have unit length:
Here R has an influence on how many neighboring pixels can
zij
be connected with a pixel, controlling the sparsity of the ad- yij =
k . (77)
2
jacency and Laplacian matrices. They provide a proof that the r=1 zir
minimization of N cut(S, S̄) can be done solving the eigenvalue
In this way all the original points are mapped into a unit
problem for the normalized Laplacian LN . In summary, the
hypersphere.
algorithm is composed of these steps:
(7) In this new representation of the original n points apply
a clustering algorithm that attempts to minimize distortion
(1) Construct the graph G starting from the data set X calcu-
such as K-means.
lating the adjacency between patterns using Eq. (74).
(2) Compute the degree matrix D.
As a criterion to choose  they suggest to use the value
(3) Construct the matrix LN = D −1/2 LD −1/2 .
that guarantees the minimum distortion when the clustering
(4) Compute the eigenvector e2 associated to the second small-
stage is performed on Y . They tested this algorithm on artificial
est eigenvalue 2 .
data sets showing the capability of the algorithm to separate
(5) Use D −1/2 e2 to segment G.
nonlinear structures. Here we show the steps of the algorithm
when applied to the data set in Fig. 1. Once the singular value
In the ideal case of two non connected subgraphs, D −1/2 e2
decomposition of L is computed, we can see the matrices Z and
assumes just two values; this allows to cluster together the
Y in Fig. 4 (here obtained with  = 0.4). Once Y is computed,
components of D −1/2 e2 with the same value. In a real case
it is easy to cluster the two groups of points obtaining the result
the splitting point must be chosen to cluster the components of
shown in Fig. 5.
D −1/2 e2 and the authors suggest to use the median value, zero
or the value for which the clustering gives the minimum N cut.
The successive partitioning can be made recursively on the ob- 4.3. Other methods
tained sub-graphs or it is possible to use more than one eigen-
vector. An interesting approach for clustering simultaneously An interesting view of spectral clustering is provided by
the data set in more than two clusters can be found in Ref. [88]. Meilă et al. [23] who describe it in the framework of Markov
random walks [23] leading to a different interpretation of the
graph cut problem. It is known, from the theory of Markov ran-
4.2. Ng, Jordan and Weiss algorithm dom walks, that if we construct the stochastic matrix P =D −1 A,
each element pij represents the probability of moving from
The algorithm that has been proposed by Ng et al. [35] uses node i to node j . In their work they provide an explicit connec-
the adjacency matrix A as Laplacian. This definition allows to tion between the spectral decomposition of L and P showing
consider the eigenvector associated with the largest eigenvalues that both have the same solution with eigenvalues of P equal
as the “good” one for clustering. This has a computational to 1 − i where i are the eigenvalues of L. Moreover, they
advantage since the principal eigenvectors can be computed for propose a method to learn a function of the features able to pro-
sparse matrices efficiently using the power iteration technique. duce a correct segmentation starting from a segmented image.
The idea is the same as in other spectral clustering methods, An interesting study on spectral clustering has been con-
i.e., one finds a new representation of patterns on the first k ducted by Kannan et al. [13]. The authors exploit the objective
eigenvectors of the Laplacian of the graph. function with respect to some artificial data sets showing that
The algorithm is composed of these steps: there is no objective function able to properly cluster every
data set. In other words there always exists some data set for
(1) Compute the affinity matrix A ∈ Rn×n : which the optimization of a particular objective function has
  some drawback. For this reason they propose a bi-criteria ob-
 xi − xj 2 jective function. These two objectives are, respectively, based
exp − if i  = j,
aij = 22 (75) on the conductance and the ratio between the auto-association
0 otherwise. of a subset of nodes S and its volume. Again the relaxation of
this problem is achieved by the decomposition of the Laplacian
(2) Construct the matrix D. of the graph associated to the data set.
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 187

0.6

0.05 0.4

0.2
0.00
0.0
z2

y2
−0.05
−0.4

−0.10
−0.8

−0.10 −0.08 −0.06 −0.04 −0.02 −0.80 −0.75 −0.70 −0.65 −0.60
z1 y1

Fig. 4. (a) The matrix Z obtained with the first two eigenvectors of the matrix L. (b) The matrix Y obtained by normalizing the rows of Z clustered by
K-means algorithm with two centroids.

6 kernel K-means [11]. We introduce a weight matrix W having


weights wk on the diagonal. Recalling that we denote with i
4
the ith cluster we have that the functional to minimize is the
2 following:

0 
c 
x2

J
(W, V
) = wk 
(xk ) − vi
2 , (78)
−2 i=1 xk ∈i

−4 where

−6 xk ∈i wk
(xk ) xk ∈i wk
(xk )
vi
=
= , (79)
−6 −4 −2 0 2 4 6 xk ∈i wk si
x1 where we have introduced:

Fig. 5. The result of the Ng and Jordan algorithm on the ring data set. si = wk . (80)
xk ∈i

5. A unified view of spectral and kernel clustering methods Now let’s define the matrix Z having
 −1/2
s if xk ∈ i ,
Recently, a possible connection between unsupervised ker- zki = i (81)
0 otherwise.
nel algorithms and spectral methods has been studied to find
whether these two seemingly different approaches can be de- Since the columns of Z are mutually orthogonal it is easy to
scribed under a more general framework. The hint for this uni- verify that
fying theory lies the adjacency structure constructed by both
si−1 = (Z T Z)ii , (82)
these approaches. In the spectral approach there is an adjacency
between patterns which is the analogous of the kernel functions and that only the diagonal elements are not null.
in kernel methods. Now we denote with F the matrix whose columns are the
A direct connection between Kernel PCA and spectral meth-
(xk ). It is easy to verify that the matrix F W yields a ma-
ods has been shown [89,90]. More recently a unifying view trix whose columns are the wk
(xk ). Moreover, the expression
of kernel K-means and spectral clustering methods [9–11] has F W ZZ T gives a matrix having n columns which are the near-
been pointed out. In this section we show explicitly the equiv- est centroids in feature space of the
(xk ).
alence between them highlighting that these two approaches Thus, substituting Eq. (79) in Eq. (78) we obtain the follow-
have the same foundation and in particular that both can be ing matrix expression for J
(W, V
):
viewed as a matrix trace maximization problem.

n
J
(W, V
) = wk F·k − (F W ZZ T )·k 2 . (83)
5.1. Kernel clustering methods objective k=1

To show the direct equivalence between kernel and spectral Here the dot has to be considered as a selection of the kth
clustering methods we introduce the weighted version of the column of the matrices. Introducing the matrix Y = W 1/2 Z,
188 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

which is orthonormal (Y T Y = I ), the objective function can be of the minimization of the normalized cut which is one of the
rewritten as most used objective functions, the functional to minimize is

n

J (W, V ) =

wk F·k − (F W 1/2 Y Y T W −1/2 )·k 2 J (S1 , . . . , Sc ) = tr(Y T D −1/2 AD −1/2 Y ). (91)


k=1
Thus the correspondence with the objective in the ker-
= F W 1/2 − F W 1/2 Y Y T 2F , (84) nel K-means imposes to choose Y = D 1/2 Z, W = D and
K = D −1 AD −1 . It is worth noting that for an arbitrary A it is
where the norm  F is the Frobenius norm [91]. Using the not guaranteed that D −1 AD −1 is definite positive. In this case
fact that AF = tr(AAT ) and the properties of the trace, it the kernel K-means will not necessarily converge. To cope
is possible to see that the minimization of the last equation is with this problem in Ref. [9] the authors propose to enforce
equivalent to the maximization of the following [9,10]: positive definiteness by means of a diagonal shift [94]:
J
(W, V
) = tr(Y T W 1/2 F T F W 1/2 Y ). (85) K = D −1 + D −1 AD −1 , (92)

5.2. Spectral clustering methods objective where  is a positive coefficient large enough to guarantee the
positive definiteness of K. Since the mathematical foundation
Recalling that the definition of association between two sets of these methods is the same, it is possible to choose which
of edges S and T of a weighted graph is the following: algorithm to use for clustering choosing, for instance, the ap-
 proach with the less computational complexity for the particu-
assoc(S, T ) = aij , (86) lar application.
i∈S,j ∈T

it is possible to define many objective functions to optimize in 6. Conclusions


order to perform clustering. Here, for the sake of simplicity,
we consider just the ratio association problem, where one has Clustering is a classical problem in pattern recognition. Re-
to maximize: cently spectral and kernel methods for clustering have provided
new ideas and interpretations to the solution of this problem.

c
assoc(Si , Si )
J (S1 , . . . , Sc ) = , (87) In this paper spectral and kernel methods for clustering have
|Si | been reviewed paying attention to fuzzy kernel methods for
i=1
clustering and to the connection between kernel and spectral
where |Si | is the size of the ith partition. Now we introduce approaches. Unlike classical partitioning clustering algorithms
the indicator vector zi whose kth value is zero if xk ∈/ i and they are able to produce nonlinear separating hypersurfaces
one otherwise. Rewriting the last equation in a matrix form we among data since they construct an adjacency structure from
obtain the following: data. These methods have been successfully tested on several

c
zT Azi benchmarks, but we can find few applications to real world
J (S1 , . . . , Sc ) = i
. (88) problem due to the high computational cost. Therefore an ex-
i=1
ziT zi tensive validation on real world applications remains a big chal-
Normalizing the zi letting: lenge for kernel and spectral clustering methods.

zi
yi = , (89) Acknowledgments
(zi zi )1/2
T

we obtain This work was partially supported by the Italian Ministry


of Education, University, and Research (PRIN 2004 Project

c
2004062740) and by the University of Genova.
J (S1 , . . . , Sc ) = yiT Ayi = tr(Y T AY ). (90)
i=1
References
5.3. A unified view of the two approaches
[1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM
Comput. Surveys 31 (3) (1999) 264–323.
Comparing Eqs. (90) and (80) it is possible to see the perfect [2] R. Xu, D.I.I. Wunsch, Survey of clustering algorithms, IEEE Trans.
equivalence between kernel K-means and the spectral approach Neural Networks 16 (3) (2005) 645–678.
to clustering when one wants to maximize the ratio associa- [3] R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley,
tion. To this end, indeed, it is enough to set the weights in the New York, 1973.
weighted kernel K-means equal to one obtaining the classical [4] P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy: The Principles and
Practice of Numerical Classification, W.H. Freeman, San Francisco, 1973.
kernel K-means. It is possible to obtain more general results
[5] J.H. Ward, Hierarchical grouping to optimize an objective function, J.
when one wants to optimize other objective functions in the Am. Stat. Assoc. 58 (1963) 236–244.
spectral approach, such as the ratio cut [92], the normalized cut [6] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function
and the Kernighan–Lin [93] objective. For instance, in the case Algorithms, Kluwer Academic Publishers, Norwell, MA, USA, 1981.
M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190 189

[7] F.R.K. Chung, Spectral Graph Theory, CBMS Regional Conference [32] X. Tan, S. Chen, Z.H. Zhou, F. Zhang, Robust face recognition from a
Series in Mathematics, vol. 92, American Mathematical Society, single training image per person with kernel-based som-face, in: ISNN
Providence, RI, February 1997. (1), 2004, pp. 858–863.
[8] M. Fiedler, Algebraic connectivity of graphs, Czechoslovak Math. J. 23 [33] D.S. Satish, C.C. Sekhar, Kernel based clustering and vector quantization
(98) (1973) 298–305. for speech recognition, in: Proceedings of the 2004 14th IEEE Signal
[9] I.S. Dhillon, Y. Guan, B. Kulis, Weighted graph cuts without eigenvectors: Processing Society Workshop, 2004, pp. 315–324.
a multilevel approach, IEEE Trans. Pattern Anal. Mach. Intell. (2007), [34] A. Majid Awan, Mohd Noor Md. Sap, An intelligent system based on
to appear. kernel methods for crop yield prediction, in: W.K. Ng, M. Kitsuregawa,
[10] I.S. Dhillon, Y. Guan, B. Kulis, A unified view of kernel k-means, J. Li, K. Chang (Eds.), PAKDD, Lecture Notes in Computer Science,
spectral clustering and graph partitioning, Technical Report TR-04-25, vol. 3918, 2006, pp. 841–846.
UTCS, 2005. [35] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and
[11] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and an algorithm, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.),
normalized cuts, in: KDD ’04: Proceedings of the Tenth ACM SIGKDD Advances in Neural Information Processing Systems, vol. 14, MIT Press,
International Conference on Knowledge Discovery and Data Mining, Cambridge, MA, 2002.
ACM Press, New York, NY, USA, 2004, pp. 551–556. [36] A. Rahimi, B. Recht, Clustering with normalized cuts is clustering with
[12] D. Verma, M. Meila, A comparison of spectral clustering algorithms, a hyperplane, Statist. Learn. Comput. Vis. (2004).
Technical Report, Department of CSE University of Washington Seattle, [37] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.
WA, 98195–2350, 2005. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.
[13] R. Kannan, S. Vempala, A. Vetta, On clusterings: good, bad, and spectral, [38] A.N. Srivastava, Mixture density Mercer kernels: a method to learn
in: Proceedings of the 41st Annual Symposium on the Foundation kernels directly from data, in: SDM, 2004.
of Computer Science, IEEE Computer Society, Silver Spring, MD, [39] N. Cristianini, J.S. Taylor, J.S. Kandola, Spectral kernel methods for
November 2000, pp. 367–380. clustering, in: NIPS, 2001, pp. 649–655.
[14] U. von Luxburg, M. Belkin, O. Bousquet, Consistency of spectral [40] I.S. Dhillon, Co-clustering documents and words using bipartite spectral
clustering, Technical Report 134, Max Planck Institute for Biological graph partitioning, in: KDD ’01: Proceedings of the Seventh ACM
Cybernetics, 2004. SIGKDD International Conference on Knowledge Discovery and Data
[15] U. von Luxburg, O. Bousquet, M. Belkin, Limits of spectral clustering, Mining, ACM Press, New York, NY, USA, 2001, pp. 269–274.
in: L.K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in Neural Information [41] Y. Kluger, R. Basri, J.T. Chang, M. Gerstein, Spectral biclustering of
Processing Systems (NIPS), vol. 17, MIT Press, Cambridge, MA, 2005. microarray data: coclustering genes and conditions, Genome Res. 13 (4)
[16] H. Zha, X. He, C.H.Q. Ding, M. Gu, H.D. Simon, Spectral relaxation (2003) 703–716.
for k-means clustering, in: NIPS, 2001, pp. 1057–1064. [42] B. Kulis, S. Basu, I.S. Dhillon, R. Mooney, Semi-supervised graph
[17] J. Leski, Fuzzy c-varieties/elliptotypes clustering in reproducing kernel clustering: a kernel approach, in: ICML ’05: Proceedings of the 22nd
hilbert space, Fuzzy Sets and Systems 141 (2) (2004) 259–280. International Conference on Machine Learning, ACM Press, New York,
[18] A.S. Have, M.A. Girolami, J. Larsen, Clustering via kernel NY, USA, 2005, pp. 457–464.
decomposition, IEEE Trans. Neural Networks 17 (1) (2006) 256–264. [43] A. Paccanaro, C. Chennubhotla, J.A. Casbon, M.A.S. Saqi, Spectral
[19] D. Horn, Clustering via Hilbert space, Phys. A Stat. Mech. Appl. 302 clustering of protein sequences, in: International Joint Conference on
(2001) 70–79. Neural Networks, vol. 4, 2003, pp. 3083–3088.
[20] F.R. Bach, M.I. Jordan, Learning spectral clustering, Technical [44] J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeff, W.S. Noble, Semi-
Report UCB/CSD-03-1249, EECS Department, University of California, supervised protein classification using cluster kernels, Bioinformatics 21
Berkeley, 2003. (15) (2005) 3241–3247.
[21] N. Cristianini, J.S. Taylor, A. Elisseeff, J.S. Kandola, On kernel-target [45] Y. Linde, A. Buzo, R. Gray, An algorithm for vector quantizer design,
alignment, in: NIPS, 2001, pp. 367–373. IEEE Trans. Commun. 1 (1980) 84–95.
[22] I. Fischer, I. Poland, New methods for spectral clustering, Technical [46] S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inf. Theory
Report IDSIA-12-04, IDSIA, 2004. 28 (1982) 129–137.
[23] M. Meila, J. Shi, Learning segmentation by random walks, in: NIPS, [47] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression,
2000, pp. 873–879. Kluwer, Boston, 1992.
[24] F. Camastra, A. Verri, A novel kernel method for clustering, IEEE Trans. [48] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford
Pattern Anal. Mach. Intell. 27 (5) (2005) 801–804. University Press, Oxford, UK, 1996.
[25] V.G. Sigillito, S.P. Wing, L.V. Hutton, K.B. Baker, Classification of radar [49] J.B. Macqueen, Some methods of classification and analysis of
returns from the ionosphere using neural networks, Johns Hopkins APL multivariate observations, in: Proceedings of the Fifth Berkeley
Tech. Digest 10 (1989) 262–266. Symposium on Mathematical Statistics and Probability, 1967,
[26] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern pp. 281–297.
separation for medical diagnosis applied to breast cytology, Proc. Natl. [50] T. Kohonen, The self-organizing map, in: Proceedings of the IEEE, vol.
Acad. Sci. USA 87 (1990) 9193–9196. 78, 1990, pp. 1464–1480.
[27] R.A. Fisher, The use of multiple measurements in taxonomic problems, [51] T. Kohonen, Self-organized formation of topologically correct feature
Ann. Eugenics 7 (1936) 179–188. maps, Biol. Cybernet. 43 (1) (1982) 59–69.
[28] S.-C. Chen, D.-Q. Zhang, Robust image segmentation using FCM with [52] T. Kohonen, Self-Organizing Maps, Springer, New York, Inc., Secaucus,
spatial constraints based on new kernel-induced distance measure, IEEE NJ, USA, 2001.
Trans. Systems Man Cybernet. B 34 (4) (2004) 1907–1916. [53] H.J. Ritter, T.M. Martinetz, K.J. Schulten, Neuronale Netze, Addison-
[29] D.-Q. Zhang, S.-C. Chen, A novel kernelized fuzzy c-means algorithm Wesley, München, Germany, 1991.
with application in medical image segmentation, Artif. Intell. Med. 32 [54] T.M. Martinetz, S.G. Berkovich, K.J. Schulten, ‘Neural gas’ network for
(1) (2004) 37–50. vector quantization and its application to time-series prediction, IEEE
[30] D.-Q. Zhang, S.-C. Chen, Z.-S. Pan, K.-R. Tan, Kernel-based fuzzy Trans. Neural Networks 4 (4) (1993) 558–569.
clustering incorporating spatial constraints for image segmentation, in: [55] R. Krishnapuram, J.M. Keller, A possibilistic approach to clustering,
International Conference on Machine Learning and Cybernetics, vol. 4, IEEE Trans. Fuzzy Systems 1 (2) (1993) 98–110.
2003, pp. 2189–2192. [56] R. Krishnapuram, J.M. Keller, The possibilistic c-means algorithm:
[31] T. Graepel, K. Obermayer, Fuzzy topographic kernel clustering, in: insights and recommendations, IEEE Trans. Fuzzy Systems 4 (3) (1996)
W. Brauer (Ed.), Proceedings of the Fifth GI Workshop Fuzzy Neuro 385–393.
Systems ’98, 1998, pp. 90–97. [57] P.J. Huber, Robust Statistics, Wiley, New York, 1981.
190 M. Filippone et al. / Pattern Recognition 41 (2008) 176 – 190

[58] O. Nasraoui, R. Krishnapuram, An improved possibilistic c-means [77] A.B. Hur, D. Horn, H.T. Siegelmann, V. Vapnik, A support vector method
algorithm with finite rejection and robust scale estimation, in: North for clustering, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), NIPS,
American Fuzzy Information Processing Society Conference, Berkeley, 2000, pp. 367–373.
CA, June 1996. [78] D.M.J. Tax, R.P.W. Duin, Support vector domain description, Pattern
[59] J. Mercer, Functions of positive and negative type and their connection Recognition Lett. 20 (11–13) (1999) 1191–1199.
with the Theory of integral equations, Proc. R. Soc. London 209 (1909) [79] C.J.C. Burges, A tutorial on support vector machines for pattern
415–446. recognition, Data Min. Knowl. Discovery 2 (2) (1998) 121–167.
[60] M. Aizerman, E. Braverman, L. Rozonoer, Theoretical foundations of [80] D. Lee, An improved cluster labeling method for support vector
the potential function method in pattern recognition learning, Automat. clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005)
Remote Control 25 (1964) 821–837. 461–464.
[61] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (1995) [81] J. Yang, V. Estirill-Castro, S.K. Chalup, Support vector clustering through
273–297. proximity graph modelling, in: Proceedings of the Ninth International
[62] B. Schölkopf, A.J. Smola, K.R. Müller, Nonlinear component analysis as Conference on Neural Information Processing, vol. 2, 2002, pp. 898–903.
a kernel eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. [82] J.-H. Chiang, P.-Y. Hao, A new kernel-based fuzzy clustering approach:
[63] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68 support vector clustering with cell growing, IEEE Trans. Fuzzy Systems
(3) (1950) 337–404. 11 (4) (2003) 518–527.
[64] S. Saitoh, Theory of Reproducing Kernels and its Applications, Longman [83] W.E. Donath, A.J. Hoffman, Lower bounds for the partitioning of graphs,
Scientific & Technical, Harlow, England, 1988. IBM J. Res. Dev. 17 (1973) 420–425.
[65] K.R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction [84] D. Wagner, F. Wagner, Between min cut and graph bisection, in:
to kernel-based learning algorithms, IEEE Trans. Neural Networks 12 Mathematical Foundations of Computer Science, 1993, pp. 744–750.
(2) (2001) 181–202. [85] M. Brand, K. Huang, A unifying theorem for spectral embedding and
[66] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector clustering, in: C.M. Bishop, B.J. Frey (Eds.), Proceedings of the Ninth
Machines, Regularization, Optimization, and Beyond, MIT Press, International Workshop on Artificial Intelligence and Statistics, 2003.
Cambridge, MA, USA, 2001. [86] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction
[67] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New and data representation, Neural Comput. 15 (6) (2003) 1373–1396.
York, Inc., New York, NY, USA, 1995. [87] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally
[68] Z.D. Wu, W.X. Xie, J.P. Yu, Fuzzy c-means clustering algorithm based linear embedding, Science 290 (5500) (2000) 2323–2326.
[88] S.X. Yu, J. Shi, Multiclass spectral clustering, in: ICCV ’03: Proceedings
on kernel method, Comput. Intell. Multimedia Appl. (2003).
of the Ninth IEEE International Conference on Computer Vision, IEEE
[69] D.-Q. Zhang, S.-C. Chen, Kernel based fuzzy and possibilistic c-means
Computer Society, Silver Spring, MD, Washington, DC, USA, 2003.
clustering, in: Proceedings of the International Conference Artificial
[89] Y. Bengio, O. Delalleau, N. Le Roux, J.F. Paiement, P. Vincent, M.
Neural Network, Turkey, 2003, pp. 122–125.
Ouimet, Learning eigenfunctions links spectral embedding and kernel
[70] R. Inokuchi, S. Miyamoto, LVQ clustering and SOM using a kernel
PCA, Neural Comput. 16 (10) (2004) 2197–2219.
function, in: Proceedings of IEEE International Conference on Fuzzy
[90] Y. Bengio, P. Vincent, J.F. Paiement, Spectral clustering and kernel PCA
Systems, vol. 3, 2004, pp. 1497–1500.
are learning eigenfunctions, Technical Report 2003s-19, CIRANO, 2003.
[71] D. Macdonald, C. Fyfe, The kernel self-organising map, in: Fourth
[91] G.H. Golub, C.F.V. Loan, Matrix Computations, Johns Hopkins Studies
International Conference on Knowledge-Based Intelligent Engineering
in Mathematical Sciences, The Johns Hopkins University Press,
Systems and Allied Technologies 2000, vol. 1, 2000, pp. 317–320.
Baltimore, MD, October 1996.
[72] A.K. Qinand, P.N. Suganthan, Kernel neural gas algorithms with
[92] P.K. Chan, M. Schlag, J.Y. Zien, Spectral k-way ratio-cut partitioning
application to cluster analysis, ICPR — 17th International Conference
and clustering, in: Proceeding of the 1993 Symposium on Research
on Pattern Recognition (ICPR’04), vol. 4, 2004, pp. 617–620.
on Integrated Systems, MIT Press, Cambridge, MA, USA, 1993,
[73] D.-Q. Zhang, S.-C. Chen, Fuzzy clustering using kernel method, in: The
pp. 123–142.
2002 International Conference on Control and Automation, 2002, ICCA,
[93] B.W. Kernighan, S. Lin, An efficient heuristic procedure for partitioning
2002, pp. 162–163.
graphs, Bell System Tech. J. 49 (1) (1970) 291–307.
[74] A.B. Hur, D. Horn, H.T. Siegelmann, V. Vapnik, Support vector [94] V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann, Optimal cluster
clustering, J. Mach. Learn. Res. 2 (2001) 125–137. preserving embedding of nonmetric proximity data, IEEE Trans. Pattern
[75] M. Girolami, Mercer kernel based clustering in feature space, IEEE Anal. Mach. Intell. 25 (12) (2003) 1540–1551.
Trans. Neural Networks 13 (3) (2002) 780–784.
[76] K. Rose, Deterministic annealing for clustering, compression,
classification, regression, and related optimization problems, Proc. IEEE
86 (11) (1998) 2210–2239.

About the Author—MAURIZIO FILIPPONE received a Laurea degree in Physics in 2004 from the University of Genova and now he is pursuing a Ph.D. in
Computer Science at University of Genova. His research interests are focused on Machine Learning techniques and their applications to Bioinformatics and
to time series analysis and forecasting.

About the Author—FRANCESCO CAMASTRA received Laurea degree in Physics in 1985 from University of Milan and the Ph.D. degree in Computer
Science from University of Genova in 2004. He was the recipient of Eduardo Caianiello Award 2005 and since February 2006 he has been Assistant Professor
at University of Napoli Parthenope. His research interests are Machine Learning and Kernel Methods.

About the Author—FRANCESCO MASULLI received the Laurea degree in Physics in 1976 from the University of Genova. He is currently an Associate
Professor in Computer Science at the University of Genova. From 1983 to 2001 he has been an Assistant Professor at the University of Genova and from
2001 to 2005 an Associate Professor at University of Pisa. He authored or co-authored more than 120 scientific papers in Machine Learning, Neural Networks,
Fuzzy Systems, Pattern Recognition and Bioinformatics.

About the Author—STEFANO ROVETTA received a Laurea degree in Electronic Engineering in 1992 and a Ph.D. degree in Models, Methods and Tools
for Electronic and Electromagnetic Systems in 1996, both from the University of Genova. He is currently an Assistant Professor in Computer Science at the
University of Genova. His current research interests include Machine Learning and Clustering techniques for high-dimensional biomedical data analysis and
document analysis.

You might also like