You are on page 1of 9

This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

Embed and Conquer:


Scalable Embeddings for Kernel k-Means on MapReduce

Ahmed Elgohary∗ Ahmed K. Farahat Mohamed S. Kamel Fakhri Karray

Abstract teins based on their structure. Data scientists tend to


The kernel k-means is an effective method for data cluster- simplify these complex structures to a vectorized format
ing which extends the commonly-used k-means algorithm to and accordingly loses the richness of the data they have.
work on a similarity matrix over complex data structures. It In order to solve these problems, much research has
is, however, computationally very complex as it requires the been conducted on clustering algorithms that work on
complete kernel matrix to be calculated and stored. Further, similarity matrices over data instances rather than on
its kernelized nature hinders the parallelization of its com- a vector representation of the data in a feature space.
putations on modern scalable infrastructures for distributed This led to the advance of different similarity-based
computing. In this paper, we are defining a family of kernel- methods for data clustering such as the kernel k-means
based low-dimensional embeddings that allows for scaling [6] and the spectral clustering [18]. The focus of this pa-
kernel k-means on MapReduce via an efficient and unified per is on the kernel k-means. Different from the tradi-
parallelization strategy. Afterwards, we propose two prac- tional k-means, the kernel k-means algorithm works on
tical methods for low-dimensional embedding that adhere kernel matrices which encode different aspects of sim-
to our definition of the embeddings family. Exploiting the ilarity between complex data structures. It has also
proposed parallelization strategy, we present two scalable been shown that the widely-accepted spectral cluster-
MapReduce algorithms for kernel k-means. We demonstrate ing method has an objective function which is equiv-
the effectiveness and efficiency of the proposed algorithms alent to a weighted variant of the kernel k-means [6],
through an empirical evaluation on benchmark datasets. which means that optimizing that criterion allows for an
efficient implementation of the spectral clustering algo-
1 Introduction rithm, in which the computationally complex eigende-
composition step is bypassed. Accordingly, the methods
In today’s era of big data, there is an increasing demand
proposed in this paper can be leveraged for scaling the
from businesses and industries to get an edge over com-
spectral clustering method on MapReduce.
petitors by making the best use of their data. Clus-
The kernel k-means algorithm requires calculating
tering is one of the powerful tools that data scientists
and storing the complete kernel matrix. Further, all
can employ to discover natural groupings from the data.
entries of the kernel matrix need to be accessed in
The k-means algorithm [13] is the most commonly-used
each iteration. As a result, the kernel k-means suffers
method for tackling this problem. It has gained popu-
from scalability issues when applied to large-scale data.
larity due to its effectiveness on many datasets as well
Some recent approaches [3, 4] have been proposed to
as the ease of its implementation on different computing
approximate the kernel k-means clustering, and allow
architectures.
its application to large data. However, these algorithms
The k-means algorithm, however, assumes that data
are designed for centralized settings, and assume that
are available in an attribute-value format, and that all
the data will fit on the memory/disk of a single machine.
attributes can be turned into numeric values so that
This paper proposes efficient algorithms for scaling
each data instance is represented as a vector in some
kernel k-means on MapReduce. The algorithms first
space where the algorithm can be applied. These as-
learn embeddings of the data instances, and then use
sumptions are impractical for real data, and they hinder
these embeddings to approximate the assignment step
the use of complex data structures in real-world cluster-
of the kernel k-means algorithm. The contributions of
ing problems. Examples include grouping users in social
the paper can be summarized as follows. (1) The paper
networks based on their friendship networks, clustering
defines a family of embeddings, which we call Approx-
customers based on their behaviour, and grouping pro-
imate Nearest Centroid (APNC) embeddings, that can
be utilized by the proposed algorithms to approximate
∗ University of Waterloo, Waterloo, Ontario, Canada N2L 3G1. kernel k-means. (2) Exploiting the properties of APNC
Emails: {aelgohary, afarahat, mkamel, karray}@uwaterloo.ca
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

embeddings, the paper presents a unified and efficient each machine and communication between different ma-
parallelization strategy on MapReduce for approximat- chines. MapReduce is currently considered the typical
ing the kernel k-means. (3) The paper proposes two software infrastructure for many data analytics tasks
practical APNC embeddings which are based on the over large distributed clusters.
Nyström method and the use of p-stable distributions In MapReduce, the dataset being processed is
for the approximation of vector norms. (4) Medium and viewed as distributed chunks of key-value pairs. A single
large-scale experiments were conducted to compare the MapReduce job processes the dataset over two phases.
proposed approach to state-of-the-art kernel k-means In the first phase, namely the map phase, each key-
approximations, and to demonstrate the effectiveness value pair is converted by a user defined map function
and scalability of the presented algorithms. to new intermediate key-value pairs. The intermedi-
The paper is organized as follows. The rest of ate key-value pairs of the entire dataset are grouped by
this section describes the notations used throughout the key, and provided through the network to a reduce
the paper. Section 2 gives a necessary background on function in the second phase (the reduce phase). The
MapReduce and the kernel k-means. We define the reducer processes a single key and its associated val-
family of APNC embeddings in Section 3, and describe ues at a time, and outputs new key-value pairs, which
the proposed scalable kernel k-means algorithms in are collectively considered the output of the job. For
Section 4. Section 7 discusses the related work. The complex analytical tasks, multiple jobs are chained to-
experiments and results are shown in Section 8. Finally, gether or multiple iterations of the same job are carried
we conclude the paper in Section 9. out over the input dataset [9]. It is important to note
that since the individual machines in cloud computing
1.1 Notations. Scalars are denoted by small letters infrastructures are of very limited memory, a scalable
(e.g., m, n), sets are denoted in script letters (e.g., L), MapReduce-algorithm should ensure that the memory
vectors are denoted by small bold italic letters (e.g., φ, required per machine remains within the bound of com-
y), and matrices are denoted by capital letters (e.g., Φ, modity memory sizes as the data size increases.
Y ). In addition, the following notations are used:
For a set L: 2.2 Kernel k-Means. The kernel k-means [6] is a
L(b) the subset of L corresponding to variant of the k-means algorithm in which the distance
the data block b. between a data point and a centroid is calculated in
|L| the cardinality of the set. terms of the kernel matrix K which encodes the inner-
For a vector x ∈ Rm : product between data points in some high-dimensional
xi i-th element of x. space. Let φ(i) be the representation of a data instance
x(i) the vector x correponding to the i in the high-dimensional space implicitly defined by the
data instance i. kernel function. The distance between this data point
x[b] the vector x corresponding to the and the centroid of cluster c can be defined as
data block b. (2.1)
kxkp the `p -norm of x. (c) 2 2 X 1 X
For a matrix A ∈ Rm×n : φ(i) − φ̄ = Kii − Kia + 2 Kab
nc nc
Aij (i, j)-th entry of A. a∈Πc a,b∈Πc
Ai: i-th row of A. where Πc is the set of points in cluster c, nc = |Πc |.
A:j j-th column of A. This means that in order to find the closest centroid
AL: , A:L the sub-matrices of A which con- for each data instance, a single pass over the whole
sist of the set L of rows and kernel matrix is needed. In addition, the whole kernel
columns respectively. need to be stored in the memory. This makes the
A(b) the sub-matrix of A correspond- computational and space complexities of the algorithm
ing to the data block b. quadratic. Accordingly, it is infeasible to implement
the original kernel k-means algorithm on MapReduce
2 Background due to the limited memory and computing power of
each machine. As we increase the data, there will be
2.1 MapReduce Framework. MapReduce [5] is a
a scalability bottleneck which limits the application of
programming model supported by an execution frame-
the kernel k-means to large-scale datasets.
work for big data analytics over a distributed environ-
ment of commodity machines. To ensure scalable and
3 Embeddings for Scaling Kernel k-Means
fault-tolerant execution of the analytical jobs, MapRe-
duce imposes a set of constraints on data access at In this section, we define a family of embeddings,
which we call Approximate Nearest Centroid (APNC)
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

embeddings, that can be used to scale kernel k-means Algorithm 1 APNC Embedding on MapReduce
on MapReduce. Essentially, we aim at embeddings Input: D, κ(., .), R, L, q
that: (1) can be computed in a MapReduce-efficient 1: for b = 1:q
manner, and (2) can efficiently approximate the cluster 2: map:
assignment step of the kernel k-means on MapReduce 3: Load L(b) and R(b)
(Eq. 2.1). We start by defining a set of properties 4: foreach < i, D{i} >
which an embedding should have for the aforementioned 5: KL(b) i ← κ L(b) , D{i}

conditions to be satisfied. (i)
6: y [b] ← R(b) KL(b) i
Let i be a data instance, and φ = Φ:i be a vector
(i)
corresponding to i in the kernel space implicitly defined 7: emit(i,y [b] )
d m
by the kernel function. Let f : R → R be an 8: map:
embedding function that maps φ to a target vector y, 9:
(i) (i) (i)
foreach < i,y [1] , y [2] , ..., y [q] >
i.e., y = f (φ). In order to use f (φ) with the proposed 
(i) (i) (i)

MapReduce algorithms, the following properties have to 10: Y:i ← join y [1] , y [2] , ..., y [q]
be satisfied. 11: emit(i,Y:i )
Property 3.1. f (φ) is a linear map, i.e., y = f (φ) = It should be noted that different embeddings of the de-
T φ , where T ∈ Rm×d . fined family differ in their definitions of the coefficients
If this property is satisfied, then for any cluster c, the matrix R.
embedding of its centroid is the same as the centroid
of the embeddings of the data instances that belong to Property 3.4. There exists a function e (·, ·) that ap-
that cluster: proximates the `2 -distance between each data point i and
the centroid of cluster c in terms of their embeddings y (i)
(c)

(c)
 1 X  (j)  1 X (j) and ȳ (c) only, i.e.,
ȳ = f φ̄ = f φ = y ,
nc nc
j∈Pc j∈Pc  
(c)
∃ e (·, ·) : φ(i) − φ̄ ≈ β e y (i) , ȳ (c) ∀ i, c ,
(c) (c) 2
where ȳ is the embedding of the centroid φ̄ .
Property 3.2. f (φ) is kernelized. where β is a constant.

In order for this property to be satisfied, we restrict the This property allows for approximating the cluster
columns of the transformation matrix T to be in the assignment step of the kernel k-means as
subspace of a subset of data instances L ⊆ D, |L| = l  
and l ≤ n: (3.3) π̃(i) = arg min e y (i) , ȳ (c) .
T = RΦT:L . c

Substituting in f (φ) gives


4 MapReduce-Efficient Kernel k-Means
(3.2) y = f (φ) = T φ = RΦT:L φ = RKLi ,
In this section, we show how the four properties of
where KLi is the kernel matrix between the set of APNC embeddings can be exploited to develop an
instances L and the i-th data instance, and R ∈ Rm×l . efficient and unified parallel MapReduce algorithms
We refer to R as the embedding coefficients matrix. for kernel k-means. We start with the algorithm for
Suppose the set L definied in Property 3.2 consistscomputing the corresponding embedding for each data
of q disjoint subsets L(1) ,L(2) ,..., and L(q) . instance, then explain how to use these embeddings for
approximating the kernel k-means. From Property 3.2
Property 3.3. The embedding coefficients matrix R is
and Property 3.3, the embedding y (i) of a data instance
in a block-diagonal form:
i is given by
 (1) 
R 0 0  (1) 
. R 0 0
R= 0
 . . 0 ,

(4.4) y (i) =  0 ..
0  KLi ,
 
0 0 R(q) .
0 0 R(q)
where q is the number of blocks and the b-th sub-
matrix R(b) along with its corresponding subset of data for a set of selected data points L. The set L consists
instances L(b) can be computed and fit in the memory of q disjoint subsets L(1) ,L(2) ,..., and L(q) . So, the
of a single machine. vector KLi can then be written in the form of q blocks
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

Algorithm 2 APNC Clustering on MapReduce we make use of Properties 3.1 and 3.4. As mentioned
Input: Y , m, k, e(., .) in Section 2.2, in each kernel k-means iteration, a data
1: Generate initial k centroids Ȳ instance is assigned to its closet cluster centroid given by
2: repeat until convergence Eq. (2.1). Property 3.4 tells us that each data instance
3: map: i can be approximately assigned to its closest cluster
4: Load Ȳ using only its embedding y (i) and the embeddings of
5: Initialize Z ← [0]m×k and g ← [0]k×1 the current centroids. Further, Property 3.1 allows us
6: foreach < i, Y:i > to compute updated embeddings for cluster centroids,
7: ĉ = arg minc e(Y:i , Ȳ:c ) using the embeddings of the data instances assigned to
8: Z:ĉ ← Z:ĉ + Y:i each cluster.
9: gĉ ← gĉ + 1 Let Ȳ be a matrix whose columns are the embed-
10: for c = 1:k dings of the current centroids. Our MapReduce algo-
11: emit(c,< Z:c , gc >) rithm for the clustering phase parallelizes each kernel
k-means iteration by loading the current centroids ma-
12: reduce:
trix Ȳ to the memory of each mapper, and using it to
13: foreach < c, Zc , Gc > 
P  P  assign a cluster ID to each data point represented by its
14: Ȳ:c ← Z:c ∈Zc Z:c / gc ∈G gc embedding y (i) . Afterwards, the embeddings assigned
15: emit(c,Ȳ:c ) to each cluster are grouped and averaged in a separate
reducer, to find an updated matrix Ȳ to be used in the
as KLi = [KLT(1) i KLT(2) i . . . KLT(q) i ]T . Accordingly, the following iteration. To minimize the network communi-
embedding formula in Eq. (4.4) can be written as cation cost, we maintain an in-memory matrix Z whose
 (1)  columns are the summation of the embeddings of the
R KL(1) i
 R(2) KL(2) i  data instances assigned to each cluster. We also main-
(4.5) (i)
y =

.
 tain a vector g of the number of data instances in each
..
 .  cluster. We only move Z and g of each mapper across
R(q) KL(q) i the network to the reducers that compute the updated
Ȳ . Algorithm 2 outlines the clustering steps on MapRe-
As per Property 3.3, each block R(b) and the sample duce.
instances L(b) used to compute its corresponding KL(b) i
are assumed to fit in the memory of a single machine. 5 APNC Embedding via Nyström Method
This suggests computing y (i) in a piecewise fashion, One way to preserve the objective function of the cluster
(i)
where each portion y[b] is computed separately using assignment step given by Eq. (2.1) is to find a low-
its corresponding R(b) and L(b) . rank kernel matrix K̃ over the data instances such
Our embedding algorithm on MapReduce computes that K ≈ K̃. Using this kernel matrix in Eq. (2.1)
the embedding portions of all data instances in rounds results in a cluster assignment which is very close to the
of q iterations. In each iteration, each mapper loads the assignment obtained using the original kernel k-means
corresponding coefficient block R(b) and data samples algorithm. If the low-rank approximation K̃ can be
L(b) in its memory. Afterwards, for each data point, decomposed into W T W where W ∈ Rm×n and m  n,
the vector KL(b) i is computed using the provided kernel then the columns of W can be directly used as an
function, and then used to compute the embedding embedding that approximates the `2 -distance between
(i)
portion as y [b] = R(b) KL(b) i . Finally, in a single data instance i and the centroid of cluster c as
map phase, the portions of each data instance i are (c)
concatenated together to form the embedding y (i) . It is (5.6) φ(i) − φ̄ ≈ w(i) − w̄(c) .
2 2
important to note that the embedding portions of each
data point will be stored on the same machine, which To prove that, the right-hand side can be simplified to
means that the concatenation phase has no network
cost. The only network cost required by the whole w(i)T w(i) − 2w(i)T w̄(c) + w̄(c)T w̄(c)
embedding algorithm is from loading the sub-matrices 2 X 1 X
= K̃ii − K̃ia + 2 K̃ab
R(b) and L(b) once for each b. Algorithm 1 outlines the nc nc
a∈Pc a,b∈Pc
embedding steps on MapReduce. We denote each key-
value pair of the input dataset D as <i, D{i}> where i The right-hand side is an approximation of the distance
refers to the index of the data instance D{i}. function of Eq. (2.1). There are many low-rank decom-
To parallelize the clustering phase on MapReduce, positions that can be calculated for the kernel matrix
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

Algorithm 3 APNC Coeff. via Nyström Method Algorithm 4 APNC Coeff. via Stable Distributions
Input: D, n, κ(., .), l, m Input: D, κ(., .), l, m, t
1: map: 1: map:
2: for < i,D{i} > 2: for < i,D{i} >
3: with probability l/n, emit(0,D{i}) 3: with probability l/n, emit(0,D{i})
4: reduce: 4: reduce:
5: for L ← all values D{i} 5: for L ← all values D{i}
6: KLL ← κ (L, L) 6: KLL ← κ (L, L)
7: [V, Λ] ← eigen(KLL ,m) 7: H ← I − 1l eeT
8: R ← Λ−1/2 V T 8: [V, Λ] ← eigen(HKLL H)
9: emit(< L, R >) 9: E ← Λ−1/2 V T
10: for r = 1:m
K, including the very accurate eigenvalue decomposi- 11: T ← select
tion. However, the low-rank approximation to be used 12: P t unique values from 1 to l
Rr: = v∈T Ev:
has to satisfy the properties defined in Section 3, and 13: emit(< L, R >)
accordingly can be implemented on MapReduce in an
efficient manner. phase to iterate over the input dataset in parallel, to uni-
One well-known method for calculating low-rank formly sample l data instances. The sampled instances
approximations of kernel matrices is the Nyström ap- are then moved to a single reducer that computes R as
proximation [19]. The Nyström method approximates described above.
a kernel matrix over all data instances using the sub- The Nsytröm embedding can be extended by the
matrix of the kernel between all data instances and a use of the ensemble Nyström method [15]. In that
few set of data instances L as case, each block of R will be the coefficients of the
T −1 Nyström embedding corresponding to the subset of data
(5.7) K̃ = D A D ,
instances that belong to that instance of the ensemble.
where |L| = l  n, A ∈ Rl×l is the kernel matrix over The details of that extension are the subject of a future
the data instances in L, and D ∈ Rl×n is the kernel work.
matrix between the data instances in L and all data in-
stances. In order to obtain a low-rank decomposition 6 APNC Embedding via Stable Distributions
of K̃, the Nyström method calculates the eigendecom- In this section, we develop our second embedding
position of the small matrix A as A ≈ U ΛU T , where method based on the results of Indyk [12] which showed
U ∈ Rl×m is the matrix whose columns are the leading- that the `p -norm of a d-dimensional vector v can be
m eigenvectors of A, and Λ ∈ Rm×m is the matrix whose estimated by means of p-stable distributions. Given a
diagonal elements are the leading m eigenvalues of A. d-dimensional vector r whose entries are i.i.d. samples
This means that a low-rank decomposition can be ob- drawn from a p-stable distribution over R, the `p -norm
tained as K̃ = W T W where W = Λ−1/2 U T D. It should of v is given by
be noted that this embedding satisfies Properties 3.1
d
and 3.2 as D = ΦT:L Φ, and accordingly y (i) = W:i = X
−1/2 T T (i) (6.8) ||v||p = αE[| v i r i |] ,
Λ U Φ:L  φ . (i)
Further, Equation (5.6) tells us that
(i) (c) (c) i=1
e y , ȳ = y − ȳ 2
can be used to approxi-
mate the `2 -distance in Eq. (2.1), which satisfies Prop- for some positive constant α. It is known that the
erty 3.4 of the APNC family. standard Gaussian distribution N (0, 1) is 2-stable [12],
The embedding coefficient matrix R = Λ−1/2 U T is which means that it can be employed to compute the
a special case of that described in Property 3.3, which `2 -norm of Eq. (2.1) as
consists of one block of size m × l, where l is the number
d
of instances used to calculate the Nyström approxima- X
(6.9) φ − φ̄ 2 = αE[| (φi − φ̄i )r i |] ,
tion, and m is the rank of the eigen-decomposition used
i=1
to compute both Λ and U . It can be assumed that R is
computed and fits in the memory of a single machine, where d is the dimensionality of the space endowed by
since an accurate Nyström approximation can usually the used kernel function and the entries r i ∼ N (0, 1).
be obtained using a very few samples and m ≤ l. Algo- The expectation above can be approximated Pd by the
rithm 3 outlines the MapReduce algorithm of computing sample mean of multiple values for the term | i=1 (φi −
the coefficients matrix R. The algorithm uses the map φ̄i )ri | computed using m different vectors r, each of
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

which is denoted as r (j) . Thus, the `2 -norm in Eq. (6.9) such that for j = 1 to m, Rj: = sT E, where s is an
can be approximated as l-dimensional binary vector indexing t randomly chosen
m d values from 1 to l for each j.
α X X (j) (j)

It is clear from Eq. (6.13) that f is a linear map in
(6.10) φ − φ̄ 2 ≈ | φi r i − φ̄i r i |
m j=1 i=1 a kernelized form which satisfies Properties 3.1 and 3.2.
Equation (6.11) shows that the `2 -norm of the difference
Define two m-dimensional embeddings y and ȳ such
Pd (j) P d (j) between a data point φ and a cluster centroid φ̄ can be
that y j = i=1 φi r i and ȳ j = i=1 φ̄i r i or approximated up to a constant by e(y, ȳ) = ky − ȳk
T 1
equivalently, y j = φT r (j) and ȳ j = φ̄ r (j) . Equation which satisfies Property 3.4 of the APNC family. The
(6.10) can be expressed in terms of y and ȳ as coefficients matrix R in Eq. (6.13) is of a single block,
m which can be assumed to be computable in the memory
α X α
(6.11) φ − φ̄ 2 ≈ |y j − ȳ j | = ky − ȳk1 . of a single commodity machine. That assumption is
m j=1 m justified by observing that R is computed using a sample
of a few data instances that are used to conceptually
Since all of φ, φ̄ and r (j) are intractable to explic-
estimate the covariance matrix of the data distribution.
itly work with, our next step is to kernelize the com-
Furthermore, the target dimensionality, denoted as m
putations of y and ȳ. Without loss of generality, let
(1) (2) (t) in Eq. (6.13), determines the sample size used to
Tj = {φ̂ , φ̂ , ..., φ̂ } be a set of t randomly chosen estimate the expectation in Eq. (6.9), which also can be
data instances embedded and centered into the kernel estimated by a small number of samples. We validate
(i) Pt
space (i.e. φ̂ = φ(i) − 1t j=1 φ(j) ). According to the the assumptions about l and m in our experiments.
central limit theorem, the vector r (j) = √1t φ∈Tj φ This accordingly satisfies Property 3.3. We outline the
P

approximately follows a multivariate Gaussian distribu- MapReduce algorithm for computing the coefficients
tion N (0, Σ), where Σ is the covariance matrix of the matrix R, defined by Eq.(6.13), in Algorithm 4. Similar
underlying distribution of all data instances embedded to Algorithm 3, we sample l data instances in the map
into the kernel space [14]. But according to our defini- phase, and then R is computed using the sampled data
tion of y and ȳ, the individual entries of r (j) have to instances in a single reducer.
be independent and identically Gaussians. To fulfil that
requirement, we make use of the fact that decorrelating 7 Related Work
the variables of a joint Gaussian distribution is enough The quadratic runtime complexity per iteration, in
to ensure that the individual variables are independent addition to the quadratic space complexity of the kernel
and marginally Gaussians. Using the whitening trans- k-means have limited its applicability to even medium-
form, r (j) is redefined as scale datasets on a single machine. Recent work [3, 4]
1 X to tackle these scalability limitations has focused only
(6.12) r (j) = √ Σ̃−1/2 φ, on centralized settings with the assumption that the
t
φ∈T (j) dataset being clustered fits into the memory/disk of a
where Σ̃ is an approximate covariance matrix estimated single machine. In specific, Chitta et al. [3] suggested
using a sample of l data points embedded into the kernel restricting the clustering centroids to an at most rank-l
space and centred as well. We denote the set of the l subspace of the span of the entire dataset where l  n.
data points as L. That approximation reduces the runtime complexity per
2
With r (j) defined as in Eq. (6.12), the computation iteration to O(l k + nlk), and the space complexity to
of y and ȳ can be fully kernelized using similar steps to O(nl), where k is the number of clusters. However,
those in [14]. Accordingly, y and ȳ can be computed as that approximation is not sufficient for scaling kernel k-
follows: let KLL be the kernel matrix of L, and define means on MapReduce, since assigning each data point
a centering matrix H = I − 1l eeT where I is an l × l to the nearest cluster still requires accessing the current
identity matrix, and e is a vector of all ones. Denote cluster assignment of all data points. It was also
the inverse square root of the centered version of KLL noticed by the authors that their method is equivalent
as E.1 The embedding of a vector φ is then given by to applying the original kernel k-means algorithm to
the rank-l Nyström approximation of the entire kernel
(6.13) y = f (φ) = RΦT:L φ , matrix [3]. That is algorithmically different from our
Nyström-based embedding in the sense that we use the
1 The centered version of K
LL is given by HKLL H. Its inverse concept of the Nyström approximation to learn low-
square root can be computed as Λ−1/2 V T where Λ is a diagonal
matrix of the eigenvalues of HKLL H and V is the eigenvector
dimensional embedding for all data instances, which
matrix of HKLL H. allows for clustering the data instances by applying
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

algorithms [16], subspace clustering [10], metric k-


Table 1: The properties of the used datasets.
centers, and metric k-median with the assumption that
Dataset Type #Ints #Fea #Clust all pairwise similarities are pre-computed and provided
USPS Digit Images 9,298 256 10 explicitly [8].
PIE Face Images 11,554 4,096 68
MNIST Digit Images 70,000 784 10 8 Experiments and Results
RCV1 Documents 193,844 47,236 103 We evaluated the two proposed algorithms by conduct-
CovType Multivariate 581,012 54 7 ing experiments on four medium and three big datasets.
ImageNet Images 1,262,102 900 164 The properties of the datasets are summarized in Table
1. All the datasets have been used in previous work
a simple and MapReduce-efficient algorithm to their to evaluate large-scale clustering algorithms in general
corresponding embeddings. [1, 2] and the kernel k-means algorithm in particular
Later, Chitta et al. [4] exploited the Random [3, 4]. In all experiments, after the clustering is per-
Fourier Features (RFF) approach [17] to propose fast formed, the cluster labels are compared to ground-truth
algorithms for approximating the kernel k-means. How- labels and the Normalized Mutual Information (NMI)
ever, these algorithms inherit the limitations of the RFF between clustering labels and the class labels is re-
approach such as being limited to only shift-invariant ported. We also report the embedding time and clus-
kernels, and requiring data instances to be in a vec- tering time of the proposed algorithms in the large-scale
torized form. Furthermore, the theoretical and empir- experiments.
ical results of Yang et al. [20] showed that the ker- The medium scale experiments were carried out
nel approximation accuracy of RFF-based methods de- using MATLAB on a single machine to demonstrate the
pends on the properties of the eigenspectrum of the effectiveness of the proposed algorithms compared to
original kernel matrix, and that ensuring acceptable ap- previously proposed method for approximating kernel
proximation accuracy requires using a large number of k-means. The large-scale experiments carried out using
Fourier features, which increases the dimensionality of the last three dataset in Table 1 were conducted on an
the computed RFF-based embeddings. In our exper- Amazon EC23 cluster which consists of 20 machines.4
iments, we empirically show that our kernel k-means In the medium-scale experiments, we compared
methods achieve clustering accuracy superior to those our algorithms - APNC via Nyström (APNC-Nys) and
achieved using the state-of-the-art approximations pre- APNC via Stable Distributions (APNC-SD) - to the
sented in [3] and [4]. approximate kernel k-means approach (Approx-KKM )
Other than the kernel k-means, the spectral cluster- of [3] and the two RFF-based algorithms (RFF and
ing algorithm [18] is considered a powerful approach to SV-RFF ) presented in [4]. For APNC-Nys, APNC-
kernel-based clustering. Chen et al. [1] presented a dis- SD, and Approx-KKM, we used three different values
tributed implementation of the spectral clustering algo- for the number of samples l while fixing the parameter
rithm using an infrastructure composed of MapReduce, t in APNC-SD to 40% of l, and m to 1000. For a
MPI, and SSTable2 . In addition to the limited scalabil- fair comparison, we set the number of Fourier features
ity of MPI, the reported running times are very large. used in RFF and SV-RFF to 500, to obtain 1000-
We believe this was mainly due to the very large net- dimensional embeddings as in APNC-SD. An RBF
work overhead resulting from building the kernel matrix kernel was used for both the PIE and ImageNet-50k5
using SSTable. Later, Gao et al. [11] proposed an ap- datasets, a neural kernel for the USPS dataset, and a
proximate distributed spectral clustering approach that polynomial kernel for the MNIST dataset. We used
relied solely on MapReduce. The authors showed that the same kernel parameters reported in [3]. Table 2
their approach significantly reduced the clustering time summarizes the average and standard deviation of the
compared to that of Chen et al., [1]. However, in the NMIs achieved in 20 different runs of each algorithm.
approach of Gao et al. [11], the kernel matrix is approx- Being limited to only shift-invariant kernels, both RFF
imated as a block-diagonal, which enforces inaccurate and SV-RFF were only used for PIE and ImageNet-
pre-clustering decisions that could result in degraded 50k. Due to the large memory requirement of storing
clustering accuracy. the kernel matrix, we report the accuracy of the exact
Scaling other algorithms for data clustering on
MapReduce was also studied in recent work [8, 10, 3 http://aws.amazon.com/ec2/
16]. However, those works are limited to co-clustering 4 Each machine comes with a memory of 7.5 GB and a two-

cores processor.
2 http://wiki.apache.org/cassandra/ArchitectureSSTable 5 Randomly sampled 50k images out of the ImageNet dataset
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

Table 2: The NMIs of different kernel k-means approximations (medium-scale datasets). In each sub-table, the
best performing methods for each l according to t-test (with 95% confidence level) is highlighted in bold.
NMI (%)
l = 50 l = 100 l = 300 l = 50 l = 100 l = 300
Methods PIE - 11K, RBF (Exact-KKM: 20.79 ± 0.45) ImageNet - 50K, RBF
RFF 5.20 ± 0.12 5.20 ± 0.12 5.20 ± 0.12 6.12 ± 0.04 6.12 ± 0.04 6.12 ± 0.04
SV-RFF 5.15 ± 0.11 5.15 ± 0.11 5.15 ± 0.11 5.96 ± 0.06 5.96 ± 0.06 5.96 ± 0.06
Approx-KKM 13.99 ± 0.6 14.66 ± 1.01 15.95 ± 0.83 14.67 ± 0.25 15.12 ± 0.17 15.27 ± 0.15
APNC-Nys ± ±
18.52 00.26 19.23 00.36 20.20 00.46 ± ± ±
15.62 00.17 15.81 00.12 15.79 00.09 ±
APNC-SD 18.62 ± 0.37 19.5 ± 0.38 20.12 ± 0.35 15.66 ± 0.14 15.78 ± 0.14 15.76 ± 0.08
Methods USPS - 9K, Neural (Exact-KKM: 59.44 ± 0.66) MNIST - 70K, Polynomial
Approx-KKM 37.60 ± 17.50 50.68 ± 11.28 57.17 ± 5.44 19.07 ± 1.45 20.73 ± 1.30 22.38 ± 1.06
APNC-Nys ± ±
51.58 11.74 55.77 03.30 58.26 00.95 ± 19.68±00.71 20.82±01.44 21.93±00.69
APNC-SD 52.88 ± 7.25 55.34 ± 4.15 58.22 ± 0.87 23.00 ± 1.57 23.08 ± 1.58 23.86 ± 1.82

kernel k-means (Exact-KKM ) only for the datasets PIE rable clustering accuracy on RCV1. Table 3 also shows
and USPS. that APNC-Nys and APNC-SD have comparable em-
In the large-scale experiment, we compared APNC- bedding times. On the other hand, the clustering step of
Nys and APNC-SD to a baseline two-stage method (2- APNC-SD is faster than that of APNC-Nys, especially
Stage) that uses the exact kernel k-means clustering in the datasets with a large number of clusters (RCV1
results of a sample of l data instances to propagate the and ImageNet). That advantage of the APNC-SD al-
labels to all the other data instances [3]. We evaluated gorithm is from using the `1 -distance as its discrepancy
the three algorithms using three different values for l function while the APNC-Nys uses the `2 -distance as
while fixing m in APNC-Nys and APNC-SD to 500. its discrepancy function6 .
We used a self-tuned RBF kernel for all datasets. Table
3 summarizes the average and standard deviation of the 9 Conclusions
NMIs achieved in three different runs of each algorithm. In this paper, we proposed parallel algorithms for scal-
The table also reports the embedding and clustering ing kernel k-means on MapReduce. We started by defin-
running times of the different APNC methods. ing a family of low-dimensional embeddings character-
It can be observed from Table 3 that the centralized ized by a set of computational and statistical properties.
versions of the proposed algorithms were significantly Based on these properties, we presented a unified paral-
superior to all the other kernel k-means approximations lelization strategy that first computes the corresponding
in terms of the clustering accuracy. Both methods per- embeddings of all data instances of the given dataset,
formed similarly in all datasets except for the MNIST, in and then clusters them in a MapReduce-efficient man-
which APNC-SD outperformed APNC-Nys. The poor ner. Based on the Nyström approximation and the
performance of RFF and SV-RFF is consistent with the properties of the p-stable distributions, we derived two
results of [20] as described in Section 7. The table also practical instances of the defined embeddings family.
shows that when using only 300 samples (i.e. l = 300) Combining each of the two embedding methods with the
APNC-Nys and APNC-SD achieve very close cluster- proposed parallelization strategy, we demonstrated the
ing accuracy to that of the exact kernel k-means which effectiveness of the presented algorithms by empirical
confirms the accuracy and the reliability of the proposed evaluation on medium and large benchmark datasets.
approximations. Our future work includes a detailed complexity analysis,
Table 3 demonstrates the effectiveness of the pro- and a study of the scalability of the proposed algorithms
posed algorithms in distributed settings compared to with respect to different tuning parameters.
the baseline algorithm. It is also worth noting that, to
the best of our knowledge, the best reported NMIs in References
the literature for the datasets RCV1, CovType, and
ImageNet are 28.65% using the spectral clustering [1], [1] Wen-Yen Chen, Yangqiu Song, Hongjie Bai,
14% using RFF [4], and 10.4% using Approx-KKM [3], Chih-Jen Lin, and E.Y. Chang, Parallel spectral
respectively. Our algorithms managed to achieve better
6 Further comments and insights on the reported results can be
NMIs on both CovType and ImageNet, and a compa-
found in our technical report [7].
This article has been accepted for publication at the 2014 SIAM International Conference on Data Mining

Table 3: The NMIs and run times of different kernel k-means approximations (big datasets). In each NMI
sub-table, the best performing method(s) for each l according to t-test is highlighted in bold.
NMI (%) Embedding Time (mins) Clustering
l = 500 l = 1000 l = 1500 l = 500 l = 1000 l = 1500 Time (mins)
Methods RCV1 - 200K
2-Stage 13.33±00.53 13.56±00.53 13.56±00.06 N/A N/A
APNC-Nys ±
22.15 00.09 ±
23.77 00.60 ±
23.84 00.80 03.1±00.1 05.9±00.2 10.9±00.5 19.6±00.1
APNC-SD ±
22.21 00.39 ±
24.34 00.26 ±
23.55 00.17 03.0±00.2 05.9±00.2 09.9±00.4 15.6±00.4
Methods CovType - 580K
2-Stage 08.95±02.98 10.23±01.07 09.85±01.88 N/A N/A
APNC-Nys 09.53±02.55 12.31±00.74 12.51±01.08 03.7±00.2 07.4±00.2 11.3±00.5 16.4±00.1
APNC-SD ±
15.96 01.03 ±
15.08 01.40 ±
15.56 00.18 03.8±00.2 07.2±00.1 11.6±00.3 15.8±00.1
Methods ImageNet - 1.26M
2-Stage 07.51±00.42 07.58±00.21 07.71±00.20 N/A N/A
APNC-Nys ±
11.33 00.05 ±
11.26 00.11 ±
11.19 00.03 15.6±00.4 30.3±02.4 45.2±01.9 63.8±02.8
APNC-SD ±
11.27 00.06 ±
11.26 00.04 ±
11.10 00.05 14.9±01.4 30.9±00.9 44.6±01.5 23.7±00.3
clustering in distributed systems, Pattern Analysis and mapreduce, in ACM SIGKDD KDD, 2011.
Machine Intelligence, IEEE Transactions on, 33 (2011), [11] Fei Gao, Wael Abd-Almageed, and Mohamed
pp. 568 –586. Hefeeda, Distributed approximate spectral clustering
[2] Xinlei Chen and Deng Cai, Large scale spectral clus- for large-scale datasets, in HPDC, ACM, 2012, pp. 223–
tering with landmark-based representation, in AAAI, 234.
2011, pp. 313–318. [12] Piotr Indyk, Stable distributions, pseudorandom gen-
[3] Radha Chitta, Rong Jin, Timothy C. Havens, erators, embeddings and data stream computation, in
and Anil K. Jain, Approximate kernel k-means: Solu- Proceedings of the Symposium on Foundations of Com-
tion to large scale kernel clustering, in ACM SIGKDD puter Science, 2000.
KDD, 2011, pp. 895–903. [13] A. K. Jain, M. N. Murty, and P. J. Flynn, Data
[4] Radha Chitta, Rong Jin, and Anil K Jain, Effi- clustering: A review, ACM Computing Surveys, 31
cient kernel clustering using random Fourier features, (1999), pp. 264–323.
in IEEE ICDM, 2012, pp. 161–170. [14] Brian Kulis and Kristen Grauman, Kernelized
[5] Jeffrey Dean and Sanjay Ghemawat, MapReduce: locality-sensitive hashing, Pattern Analysis and Ma-
Simplified data processing on large clusters, Communi- chine Intelligence, IEEE Transactions on, 34 (2012),
cations of the ACM, 51 (2008), pp. 107–113. pp. 1092–1104.
[6] Inderjit S Dhillon, Yuqiang Guan, and Brian [15] Sanjiv Kumar, Mehryar Mohri, and Ameet Tal-
Kulis, Weighted graph cuts without eigenvectors a mul- walkar, Ensemble Nyström Method, in NIPS, 2009,
tilevel approach, Pattern Analysis and Machine Intel- pp. 1060–1068.
ligence, IEEE Transactions on, 29 (2007), pp. 1944– [16] Spiros Papadimitriou and Jimeng Sun, Disco: Dis-
1957. tributed co-clustering with map-reduce: A case study
[7] Ahmed Elgohary, Ahmed K. Farahat, Mo- towards petabyte-scale end-to-end mining, in Data Min-
hamed S. Kamel, and Fakhri Karray, Embed and ing, 2008. ICDM ’08. Eighth IEEE International Con-
conquer: Scalable embeddings for kernel k-means on ference on, 2008, pp. 512–521.
mapreduce, CoRR, abs/1311.2334 (2013). [17] Ali Rahimi and Benjamin Recht, Random fea-
[8] Alina Ene, Sungjin Im, and Benjamin Moseley, tures for large-scale kernel machines, in NIPS, 2007,
Fast clustering using MapReduce, in ACM SIGKDD pp. 1177–1184.
KDD, 2011, pp. 681–689. [18] Ulrike von Luxburg, A tutorial on spectral cluster-
[9] Ahmed K. Farahat, Ahmed Elgohary, Ali Gh- ing, Statistics and Computing, 17 (2007), pp. 395–416.
odsi, and Mohamed S. Kamel, Distributed Column [19] Christopher Williams and Matthias Seeger, Us-
Subset Selection on MapReduce, in Proceedings of the ing the Nyström method to speed up kernel machines,
Thirteenth IEEE International Conference on Data in NIPS, MIT Press, 2000, pp. 682–688.
Mining, 2013. [20] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi,
[10] Robson Leonardo Ferreira Cordeiro, Caetano Rong Jin, and Zhi-Hua Zhou, Nyström method vs
Traina, Junior, Agma Juci Machado Traina, random Fourier features: A theoretical and empirical
Julio López, U. Kang, and Christos Faloutsos, comparison, in NIPS, 2012, pp. 485–493.
Clustering very large multi-dimensional datasets with

You might also like