Professional Documents
Culture Documents
Introduction
2. Distributed clustering :
Distributed Clustering assumes that the objects to be clustered reside on different sites. Instead of
transmitting all objects to a central site (also denoted as server) where we can apply standard
clustering algorithms to analyze the data, the data are clustered independently on the different
local sites(also denoted as clients). In a subsequent step, the central site tries to establish a global
clustering based on the local models, i.e. the representatives. This is a very difficult step as there
might exist dependencies between objects located on different sites which are not taken into
consideration by the creation of the local models. In contrast to a central clustering of the
complete dataset, the central clustering of the local models can be carried out much faster.
Distributed Clustering is carried out on two different levels, i.e. the local level and the global
level (cf. Figure 1).On the local level, all sites carry out a clustering independently from each
other. After having completed the clustering, a local model is determined which should reflect an
optimum trade-off between complexity and accuracy. Our proposed local models consist of a set
of representatives for each locally found cluster. Each representative is a concrete object from
the objects stored on the local site. Furthermore, we augment each representative with a suitable
ε−range value. Thus, a representative is a good approximation for all objects residing on the
corresponding local site which are contained in the ε−range around this representative. Next the
local model is transferred to a central site, where the local models are merged in order to form a
global model. The global model is created by analyzing the local representatives. This analysis is
similar to a new clustering of the representatives with suitable global clustering parameters. To
each local representative a global cluster-identifier is assigned. This resulting global clustering is
sent to all local sites. If a local object belongs to the ε-neighborhood of a global representative,
the cluster-identifier from this representative is assigned to the local object. Thus, we can achieve
that each site has the same information as if their data were clustered on a global site, together
with the data of all the other sites.
To sum up, distributed clustering consists of four differ-ent steps (cf. Figure 1):•Local
clustering•Determination of a local model •Determination of a global model, which is based on
all lo-cal models•Updating of all local models
Figure 1: Distributed Clustering
Peer-to-Peer (P2P) :
3. Algorithme distribué
3.2 Density methods (DBSCAN: Density Based Spatial Clustering of Applications with
Nois):
In this section, we present the algorithm DBSCAN (Density Based Spatial Clustering of
Applications with Noise) which is designed to discover the clusters and the noise in a spatial
database according to definitions 5 and 6. Ideally, we would have to know the appropriate
parameters Eps and MinPts of each cluster and at least one point from the respective cluster.
Then, we could retrieve all points that are density-reach-able from the given point using the
correct parameters. But there is no easy way to get this information in advance for all clusters of
the database. However, there is a simple and effective heuristic to deter-mine the parameters Eps
and MinPts of the "thinnest", i.e. least dense, cluster in the database. Therefore, DBSCAN uses
global values for Eps and MinPts, i.e. the same values for all clusters. The density parameters of
the "thinnest"cluster are good candidates for these global parameter values specifying the lowest
density which is not considered to be noise.
To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-
reachable from p wrt. Epsand MinPts. If p is a core point, this procedure yields a cluster wrt. Eps
and MinPts . If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database. Since we use global values for Eps and MinPts, DBSCAN may
merge two clusters according to definition 5 into one cluster, if two clusters of different density
are "close" to each other. Let the distance between two sets of points S l and S2be defined as dist
(S1, Sx) = min {dist(p,q) I p b q C$2}.Then, two sets of points having at least the density of the
thinnest cluster will be separated from each other only if the distance between the two sets is
larger than Eps. Consequently, a recursive call of DBSCAN may be necessary for the detected
clusters with a higher value for MinPts. This is, however, no disadvantage because the recursive
application of DBSCAN yields an elegant and very efficient basic algorithm. Furthermore, the
recursive clustering of the points of a cluster is only necessary under conditions that can be easily
detected. In the following, we present a basic version of DBSCAN omitting details of data types
and generation of additional information about clusters:
// SetOfPoints is UNCLASSIFIED
ClusterId := nextId(NOISE);
Point := SetOfPoints.get(i);
ClusterId := nextId(ClusterId)
END IF
END IF
END FOR
END; // DBSCAN
SetOfPoints is either the whole database or a discovered cluster ~om a p~vious run. Eps and
M$nPgs are the global density parameters determined either manually oraccording to the
heuristics presented in section 4.2. The~ncfion SetOfPoints. get ( i ) returns the i-th element of
SetOfPoints. The most important function used by DBSCAN is Expand Cluster which is
presented below:
RETURN False;
ELSE
seeds.delete(Point);
currentP := seeds.first();
result := setofPoints.regionQuery(currentP,Eps) ;
resultP := result.get(i)
IF resultP.CiId
Seeds.append(resultP);
END IF;
END FOR ;
Seeds.delete(currentP);
RETURN True ;
END IF
END;// ExpandCluster
A call of SetOfPoint s.regionQue-ry(Point, Eps) returns the Eps-Neighb0rho0d Point in
SetOfPoints as a list of points. Region queries can be supported efficiently by spatial access
methods such as R*-trees (Beckmann et al. 1990) which are assumed to be available in a SDBS
for efficient processing of several types of spatial queries (Brinkhoffet al. 1994). The height an
R*-tree is O(log n) for a database of n points in the worst case and a query with a "small" query
region has to traverse only a limited number of paths m the R -tree. Since the Eps Neighborhoods
are expected to be small compared to the size of the whole data space, the average run time
complexity of a single region query is O(log n). For each of the points of the database, we have
at most one region query. Thus, the average run time complexity of DBSCAN is O(n * log
n).The C1 Td (cluster Id) of points which have been marked to be NOISE may be changed later,
if they are density-reach-able from some other point of the database. This happens for border
points of a cluster. Those points are not added to the seeds-list because we already know that a
point with a Cl Id of NOISE is not a core point. Adding those points to seeds would only result
in additional region queries which would yield no new answers. If two clusters C! and C2 are
very close to each other, it might happen that some point p belongs to both, C1 and C2.Then p
must be a border point in both clusters because other-wise C1 would be equal to C2 since we use
global parameters. In this case, point p will be assigned to the cluster discovered first. Except
from these rare situations, the result of DBSCAN is independent of the order in which the points
of the database are visited due to Lemma.
In this section, the distributed k-means algorithm will be proposed. For this purpose, we first
briefly introduce the centralized k-means algorithm which has been very well developed in the
existing literature.
3.3.2.1 Introduction to the Centralized k-Means Algorithm
Given a set of observations {x1, x2, . . . , xn}, where each observation is a d-dimensional real-
valued vector, the k-means algorithm [7] aims to partition the n observations into k(≤ n) sets S =
{S1, S2, . . . , Sk} so as to minimize the within-cluster sum of squares (WCSS) function. In other
words, its objective is to find
k
argmin=
∑ ∑
2
‖ xi−cj‖
S j=1 xi ∈ si
where each xp is assigned to exactly one cluster, even if it could be assigned to two or more of
them. Apparently, this step can minimize the WCSS function.
During the update step, the centroid, say ci(T + 1), of the observations in the new cluster is
computed as follows
Since the arithmetic mean is a least-squares estimator, this step also minimizes the WCSS
function.
The algorithm converges if the centroids no longer change.
The k-means algorithm can converge to a (local) optimum, while there is no guarantee for it to
converge to the global optimum [7]. Since for a given k, the result of the above clustering
algorithm depends solely on the initial centroids, a common practice in clustering the sensor
observations is to execute the algorithm several times for different initial centroids and then
select the best solution.
The computational complexity of the k-means algorithm is O(nkdM) [7], where n is the number
of the d-dimensional vectors, k is the number of clusters, and M is the number of iterations
before reaching convergence.
3.3.2.2 Choosing Initial Centroids Using the Distributed k-Means++ Algorithm :
The choice of the initial centroids is the key to making the k-means algorithm work well. In the
k-means algorithm, the partition result is dependent only on the initial centroids for a given k, so
does the distributed one. Thus, an effective distributed initial centroids choosing method is
important. Here we provide a distributed implementation of the k-means++ algorithm [24], a
centralized algorithm to find the initial centroids for the k-means algorithm. It is noted that k-
means++ generally outperforms k-means in terms of both accuracy and speed [24].
Let D(x) denote the distance from observation x to the closest centroids that have been chosen.
The k-means++ algorithm [24] is executed as follows.
1) Choose randomly an observation from {x1, x2, . . . , xn} as the first centroid c1.
2) Take a new center cj, choosing x from the observations with probability