Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
3Activity
0 of .
Results for:
No results containing your search query
P. 1
Scaling Clustering Algorithms for Massive Data Sets Using Data Streams

Scaling Clustering Algorithms for Massive Data Sets Using Data Streams

Ratings: (0)|Views: 211|Likes:
Published by newtonapple

More info:

Published by: newtonapple on Nov 19, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/20/2013

pdf

text

original

 
Scaling Clustering Algorithms for Massive Data Sets using Data Streams
Silvia NittelNCGIAUniversity of MaineOrono, ME, USAnittel@spatial.maine.eduKelvin T. LeungComputer ScienceUniversity of CaliforniaLos Angeles, CA, USAkelvin@cs.ucla.eduAmy BravermanEarth & Space Science DivisionJet Propulsion LaboratoryPasadena, CA, USAamy@jord.jpl.nasa.gov
Abstract
Computing data mining algorithms such as clusteringtechniques on massive geospatial data sets is still not feasi-ble nor efficient today. Massive data sets are continuously produced with a data rate of over several TB/day. In thecase of compressing such data sets, we demonstrate the ne-cessity for clustering algorithms that are highly scalablewith regard to data size and utilization of available com- puting resources to achieve an overall high performancecomputation. In this paper, we introduce a data stream-based approach to clustering, the partial/merge k-means al-gorithm. We implemented the partial/merge k-means as aset of data stream operators, which are adaptable to avail-able computing resources such as volatile memory and pro-cessors by parallelizing and cloning operators, and by com- puting k-means on data partitions that fit into memory, and then merging the partial results. In our extensive analyticaland experimental performance evaluation, we show that the partial/merge k-means in comparison to a serial algorithmoutperforms a serial implementation by a large margin withregard to overall computation time and achieves a signifi-cantly higher clustering quality.
1 Introduction
Computing data mining algorithms such as clusteringtechniques on
massive data sets
is still not feasible nor ef-ficient today. Massive data sets are continuously producedwith a data rate of over several TB/day, and are typicallystored on tertiary memory devices. To cluster massive datasets or subsets, overall execution time and scalability areimportant issues.For instance, the NASA Earth Observation System’s In-formation System (EOSDIS) collects several TB of satelliteimagery data daily, and data subsets thereof are distributedto scientists. To facilitate improved data distribution andanalysis, we substitute data sets with compressed counter-parts. In our compression approach, we use a techniquethat is able to capture the high order interaction between theattributes of measurements as well as their non-parametricdistribution, and assume a full dependence of attributes. Todo so, we partition the data set into 1 degree x 1 degree gridcells, and compress each grid cell individually using
mul-tivariate histograms
[6]. For the compression of a singleglobal coverage of the earth for the MISR instrument [22],we need to compute histograms for 64,800 individual gridcells each of which contains up to 100,000 data points thathave between 10 to 100 attributes
1
. To achieve a dense rep-resentation within the histograms, we use non-equi-depthbuckets so that the shapes, sizes, and number of buckets areable to adapt to the shape and complexity of the actual datain the high dimensional space. To compute a histograms fora grid cell, we use a clustering technique that produces themultivariate histograms. This example motivates the neces-sity for an approach to implementing clustering algorithmssuch as k-means that are
highly scalable
. They need to bescalable with regard to the overall data size, the numberof data points within a single grid cell, the dimensional-ity of data points, and the utilization of available computingresources in order to achieve an overall high performancecomputation for massive data.We focus on approaches that allow us to perform k-means in a highly scalable way in order to cluster massivedata set efficiently; with this view point, we do not considersolutions that improve k-means algorithm otherwise the re-lated work in the field can be readily be applied to providefurther algorithmic improvements.
1.1 Requirements from a Scalability and Performance Perspective
From a scalability and performance perspective, the fol-lowing criteria are relevant designing a k-means algorithmfor massive data:
1
Typically, a global coverage of measurements is collected, dependingon the instrument, between every 2 to 14 days.
1
 
Figure 1. Swath of the MISR satellite instru-ment
Handling large numbers of data points
: For massivedata sets, an algorithm should be highly scalable with re-gard to the
number 
and
dimensionality
of data points thatneed to be clustered. That includes that it should scale au-tomatically from small to massive data sets.
Overall efficient response time
: The algorithm shouldadapt to available resources such as computational re-sources and memory automatically, and maximize the uti-lization of resources in a greedy way.
High quality clustering results
: The results of clus-tering should provide a highly faithful representation of the original data, and capture all correlations between datapoints.
Easily interpretable results
: Compressing massivedata sets into equivalents that preserve the temporal-spatialstructure should also represent data using descriptions thatcan easily be interpreted by an end user such as a climateresearcher.
1.2 Contributions of this Paper
In this paper, we propose a scalable, parallel implemen-tation of the k-means clustering algorithm based on the datastream paradigm. A single data stream-based operator con-sumes one or several data items from an incoming datastream, processes the data, and produces a stream of outputdata items, which are immediately consumed by the nextdata stream operator; thus, all data stream operators processdata in a pipelined fashion. Using this paradigm for massivedata sets, several restrictions can be observed:1. Each data item from the original data set should be
scanned only once
to avoid data I/O,2. An operator can store only a
limited amount 
of stateinformation in volatile memory, and3. There is limited control over the
order of arriving dataitems
.Nevertheless, data streaming allows us to exploit the inher-ent parallelism in the k-means algorithm.We introduce the
partial/merge k-means
algorithmwhich processes the overall set of points in ’chunks’, and’merges’ the results of the partial k-means steps into anoverall cluster representation. The partial k-means andthe merge k-means are implemented as data stream oper-ators that are adaptable to available computing resourcessuch as volatile memory and processors by parallelizing andcloning operators, and by computing k-means on partitionsof data that can be fit into memory. Our analytical and ex-tensive experimental tests, we compare the scalability, per-formanceandclusteringqualityofaserialandadata-streambased k-means implementation. A serial implementationassumes that grid cells are computed serially, and it requiresthat all data points are present in memory. The experimen-tal results suggest that our approach scales in an excellentway with regard to the memory bottleneck, and producesclusters that are of better quality than generated a serial k-means algorithm especially for large datasets.The remaining paper is structured as follows: Section2 states the problem in more detail and discusses relatedwork. Section 3 contains the detailed description of the par-tial/merge k-means, and Section 4 includes implementationdetails of our prototype. Section 5 contains our experimen-tal results. We conclude this paper in Section 6 with fewremarks.
2 Clustering Massive Data Sets
Several methods of data compression in sparse high di-mensional temporal-spatial spaces are based on
clustering
.Here, a data set is subdivided into temporal-spatial gridcells, and each cell is compressed separately. A grid cellcan contain up to 100,000 n-dimensional data points. Dur-ing clustering, grid cell’s data points are partitioned intoseveral disjunct clusters such that the elements of a clus-ter are similar, and the elements of disjunct clusters are dis-similar. Each cluster is represented by a specific element,called
cluster centroid 
. For non-compression applications,the goal is to partition a data set into cluster; for compres-sion, we are interested in representing the overall data setvia the cluster centroids.
 
Two strategies can be used to identify cluster: a) find-ing densely populated areas in a data set, and dissecting thedata space according to the dense areas (clusters) (densityestimation-based methods), or b) using an algorithm thatattempts to find the clusters iteratively (k-means). In ourwork, we will focus on the k-means approach. In detail,the k-means algorithm can be formalized as described asfollows. K-means is an iterative algorithm to categorize aset of data vectors with metric attributes in the followingway: Given a set
of 
N D
-dimensional vectors, form k disjoint non-empty subsets
{
1
,
2
,...C 
k
}
such that eachvector
v
ij
i
is closer to mean(
i
) than any other mean.Intuitively, k-means algorithms can be explained with thefollowing steps:1.
Initialization
: Select a set of k initial cluster centroidsrandomly, i.e.
m
j
,
1
j
k
.2.
Distance Calculation
: For each data point
i
,
1
i
n
, compute its Euclidean distance to each cen-troid
m
j
,
1
j
k
, and find the closest clus-ter centroid. The Euclidean distance from vector
v
j
to a cluster centroid
c
k
is defined as
dis
(
c
k
,v
j
) =(
d
=1
,D
(
c
kd
v
jd
)
2
)
1
/
2
.3.
Centroid Recalculation
: For each
1
j
k
, com-puted the actual mean of the cluster
j
which is de-fined as
µ
j
= 1
/
|
j
| ∗
v
j
v
; the cluster centroid
m
j
’jumps’ to the recalculated actual mean of the clus-ter, and defines the new centroid.4.
Convergence Condition
: Repeat (2) to (3) until con-vergence criteria is met. The convergence criteria isdefined as the difference between the mean square er-ror
(
MSE 
)
in the previous clustering iteration
=(
n
1)
and the mean square error in the current clus-tering iteration
= (
n
)
. In particular, we choose
(
MSE 
(
n
1)
MSE 
(
n
))
≤
1
×
10
9
(referit to our experiments in section
§
5.2).The quality of the clustering process is indicated by the
error function E 
which is defined as
=
k
=1
,K
v
k
µ
k
v
2
where
is the total number of clusters.The k-means algorithm iteratively minimizes this func-tion. The quality of the clustering depends on the selectionof k; also, the k-means algorithm is sensitive to the initialrandom choice of k seeds from the existing data points. Inthe scope of this paper, we assume that we are able to makean appropriate choice of k; however, we will vary on theselection of initial random seeds.There are several improvements for step 2 that allow usto limit the number of points that have to be re-sorted; how-ever, since this is not relevant with regard to the scope of thepaper, we will not consider it further.
2.1 Problem Statement
Clustering temporal-spatial data in high dimensionalspaces using k-means is expensive with regard to both
com-putational costs
and
memory requirements
. Table 1 de-picts the symbols used in the complexity analysis of the al-gorithms.
N Number of data pointsI Number of Iterations to convergeK Total number of CentroidsG Total number of grid cellsR Total number of experiment run (with different initial
k
seeds)p Total number of chunks/partitions used in partial/merge K-Means
Table 1. Symbols used in complexity analysisof the algorithms.
Computing k-means via a
serial algorithm
, i.e. scan-ning a grid cell
j
at a time, and compressing it, and thenscanning the next grid cell, all
data points belonging toone grid cell have to be kept in memory. The algorithm uses
iterations to converge, and it is run
R
times with
R
differ-ent sets of randomly chosen initial seeds. In this case, thememory complexity is
O
(
), and the time complexity is
O
(
GRIKN 
)
whereby
G
isthenumberofoverallgridcells.Here, both the memory and the computational resources arebottlenecks for the serial k-means.Two aspects to the memory bottleneck need to be con-sidered: the volatile memory that is available via the vir-tual memory management, and the actual memory availablevia RAM. From a database perspective, control over RAMis essential to control any undesired paging effects. If therelevant data points do not fit into RAM and to avoid un-controlled paging effects by the underlying operating sys-tem, the data set has to be broken up, and clustered incre-mentally. In the database literature, several approaches dealwith the problem of large data sizes ([2, 5, 14, 18, 25]), aproblem that is not considered in k-means algorithms usedin statistics, machine learning, and pattern recognition. Wewill refer to this work in more detail in the related work section 2.2.
Parallel implementations
of k-means deal with the bot-tleneck of 
computational resources
when clustering largeamounts of high dimensional data. A group of processorsis utilized to solve a single computational problem. Sev-eral ways of parallelizing k-means using either massivelyparallel processor machines or networks of PCs can be con-sidered
2
.As shown in Figure 2,
Method A
is a naive way of par-allelizing k-means is to assign the clustering of 
one grid 
2
Forpriceperformancereasons, weconsidershared-nothing, orshared-disk environments as available in networks of PC.

Activity (3)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
Maroo Goli liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->