You are on page 1of 15

CHAPTER 7

A GRID CLUSTERING ALGORITHM

7.1 Introduction

The grid-based methods have widely been used over all the algorithms discussed in
previous chapters due to their rapid clustering results. In this chapter, a non-
parametric grid-based clustering algorithm is presented using the concept of boundary
grids and local outlier factor [31].

7.2 Algorithm Based on Local Outlier Factor

Let D = (x1, x2,…, xn) be the given set of n data points embedded in an m dimensional
space. Each xi is composed of m attributes. i.e., xi = (xi1, xi2,…, xim). Initially a single
grid is used to represent all the given n points. The extrema of this grid are taken as
the minimum and maximum attribute values in each dimension [129]. Let Min(i) =
min{x1i, x2i,..., xni} and Max(i) = max{x1i, x2i,..., xni}. Then the initial grid G(n;m) is
represented as follows. G(n;m) = [Min(1), Max(1)] × [Min(2), Max(2)] × … ×
[Min(m), Max(m)]. Initially the value of k (number of clusters) is one. Now, we
partition the grid G(n;m) into two equal volume grids G1(n1;m) and G2(n2;m) in a
uniformly selected dimension m. Now, the data points of G(n;m) are distributed to
these two grids. Here, we define the grids (cells) which contain at least one of the
given points as non-empty grids. All the other grids are remained as empty grids. The
empty grids which share the common vertex with non-empty grids are known as
boundary grids. It can easily be noted that each cluster is surrounded by the boundary
grids. After each round of partitioning of grids, it is necessary to check the presence
of new clusters. Here, a cluster is defined as the collection of points in the non-empty-
grids which are connected by the common vertices in the grid structure. In the next
round of partitioning, the two grids G1(n1;m) and G2(n2;m) are partitioned into four

110
7.2 Algorithm Based on Local Outlier Factor
equal volume grids G1,1(n3;m), G1,2(n4;m); and G2,1(n5;m) and G2,2(n6;m) in another
chosen dimension. Then the points are distributed as per the new grids. In this way all
the grids of earlier iteration are bisected and the partitioning process is continued until
a termination condition is met. In order to find the case of optimal grid structure we
use the boundary grids. As discussed above, a cluster is defined as the collection of
the points of directly connected non-empty grids. Each cluster is surrounded by the
boundary grids. We also define the volume of a cluster as sum of the volumes of all
the non-empty grids of that cluster. Here, we have observed that the volume of a
cluster strictly decreases as the partitioning process continues. At the same time, the
number of surrounding empty grids (boundary grids) increases significantly. This is
clearly depicted in the Figures 7.1(a-e). If the required clusters are formed and we
continue the partitioning process, a few clusters split into multiple clusters as shown
in the Figure 7.1(e). Some of these clusters are surrounded by very small number of
boundary grids compared to the clusters of previous iteration. For instance, five
clusters surrounded by very less boundary grids are shown in the Figure 7.1(e).
Therefore, this case indicates that the clusters of previous iteration are of our
objective and the corresponding grid size is the optimal one compared to the sizes of
all the other iterations.

111
7.2 Algorithm Based on Local Outlier Factor

(a) (b)

(c) (d)

(e)
Figure 7.1: Clusters produced by grouping the points in the grids with common
vertices (a) one cluster; (b) five clusters; (c) five clusters; (d) more than five clusters;
(e) a cluster partitioned into several sub-clusters with less boundary grids.

112
7.2.1 Problem of Outliers

7.2.1 Problem of Outliers

The above idea is sensitive if the given data set involves outlier points. As the outliers
are far from the clusters, it can be observed that these points form separate clusters
during the partitioning process. This happens because the outlier region is separated
from the other clusters as the number of grids increases (or the grid size decreases).
As per our definition of a cluster, the outlier points are surrounded by the boundary
grids. Here, the number of boundary grids is significantly less compared to the
boundary grids of other clusters which indicate the termination stage as per the above
discussion. Therefore, the clusters may not be fully formed as the partitioning process
is uncompleted with the effect of outliers. Hence, we need a measure of finding an
outlier to continue the partitioning process smoothly. For that purpose, we have used
the local outlier factor (LOF) proposed by Breunig et al. [31] to compute the degree
to which a point is an outlier.

Initially, the local neighborhood of a point x  S with respect to the minimum


points threshold mp is defined as follows:

N(x, mp) = {y  S / d (x, y) ≤ d (x, xmp)

where xmp is the mpth nearest neighbor of x. Thus N(x, mp) contains at least
mp points. The density of a point x  S is defined as follows.

 N x, mp  
densityx, mp     (7.1)
  y N x, mp  d x, y  
 

If the distances between x and its neighboring points are small, then the
density of x is high. Then the average relative density (ard) of x is calculated as
below.
density x, mp 
ard  x, mp   (7.2)
  y N x, mp  density y, mp  
 
   
 N x , mp 

113
Now, the local outlier factor (LOF) of x is defined as the inverse of the
average relative density of x.

1
i.e., LOF (x, mp) = (7.3)
ard x, mp 

If a point belongs to any cluster, then its corresponding value of LOF is closer
to one because the density of that point and the densities of its neighboring points are
roughly equal.

If any cluster has significantly less number of boundary grids in any iteration
of partitioning, we compute the LOF value for all the points of that cluster by using
the above definition. If any point is an outlier we continue the partitioning process.
The process is terminated otherwise. The pseudo code of this algorithm is provided
below.

Algorithm OPT-GRID (S)

Input: A set S of n data points;


Output: A set of clusters C1, C2,…, Ck.
Functions and variables used:
Grid (n; m): A function to find the initial grid structure of the given set S of n points
with dimension m.
Partition (G): A function to partition all the grids into two equal volume new grids.
Connected (); A function to find the clusters from the grid structure using the
common vertices in the grid structure.
LOF (x): A function to compute the local outlier factor.
EGj: Empty grids.
NEj: Non-empty grids.
BGj: Boundary grids.

114
Step 1: Call Grid (n; m);
Step 2: p  1;
Step 3: Call Partition (G) to find the equal volume grids G1 and G2.
Step 4: Call Connected () to produce the clusters (say, C1, C2,…, Cl for some l) by
grouping points in the grids which are connected by the common vertices in
the grid structure.
Step 5: Find the number of boundary grids BGj of all the clusters C1, C2,…, Cl ;
Step 6: If (p > 1) then
If the number of boundary grids of any cluster C of t points is
less than 20% of the boundary grids of any cluster in (p-1)th
iteration then
Go to step 7.
Else
{
p  p + 1;
Go to step 3 to call the Grid function for all the
grids.
}
Else
Go to step 3 to call the Grid function for all the
grids.
Step 7: Call LOF (xq)  xq  C, 1 ≤ q ≤ t;
Step 8: If LOF (xq) ≈ 1  xq  C then Go to step 9;
Else /* Sign of outliers */
p  p + 1;
Go to step 3 to continue the partitioning process for all the grids.
Step 9: Output the clusters C1, C2,…, Ck with respect to the grid size in the (p-1)th
iteration.
Step 10: Exit ();

115
Function Grid (n; m)
{
Min (y) = min {x1y, x2y,..., xny}, 1 ≤ y ≤ m;
Max (y) = max {x1y, x2y,..., xny}, 1 ≤ y ≤ m;
G(n;m) = [Min(1), Max(1)] × [Min(2), Max(2)] × … × [Min(m), Max(m)].
i.e., Gn; m   Min y , Max  y 
1 y  m

}
Connected ( )
{
Step 1: l  1;
Step 2: Start with any random grid G.
Step 3: If G is non-visited, then
mark it as visited and add all the points of G in Cl.
i.e., Cl  Cl  {points in the grid G};
Else Go to step 2.
Step 4: Find the non-visited grids Gr (for some r) shared by G with
any common vertex.
Step 5: Add the points of all Gr to the cluster Cl and Gr is visited.
i.e., Cl  Cl  {points in the grids Gr};
Step 6: Repeat the steps 4 and 5 for all the grids Gr until no new grid is
identified.
 
Step 7: If   C j  n  then
 j 
l  l + 1;
Go to step 2 to restart the process.
Else
Return (C1, C2,…, Cj);

116
7.2.2 Experimental Analysis
Time Complexity: The proposed algorithm constructs the grid structure of the given
data points for a finite number of times (say p) with respect to the uniformly chosen
dimensions. This task requires O (pn) time as the partitioning process in an iteration
requires linear time. Initially, the local outlier factor (LOF) is computed for the points
(if existed) from the cluster that represents the outliers. The LOF is also computed for
non-outlier points only in the last iteration of the algorithm. Hence, the LOF is
computed very less number of times (say h) compared to the given data points (n).
This is required O (hn) time. It can easily be seen that both p and h are very small
compared to n. Therefore, the overall time complexity of the proposed algorithm may
be linear.

7.2.2 Experimental Analysis

We performed wide experiments of the proposed algorithm on many synthetic and


biological data. In order to prove the efficiency of the proposed method, we compared
the experimental results with the existing clustering techniques, namely K-means
[26], IGDCA [145] and GGCA [129]. We used normalized information gain (NIG) to
evaluate the quality of the clusters quantitatively. The NIG is defined as follows.

Normalized Information Gain: Normalized Information Gain [91] is a measure of


the quality of clusters based on the information gain [114]. This measure has been
shown very effective for evaluating data classification. Hence, it is extended to
evaluate clusters of supervised data. NIG is expressed in terms of total and weighted
entropy’s defined as follows. Suppose there are L classes and each of the given n data
point belongs to any one of the L classes. Let cl denote the number of data points of

the class l, for l = 1, 2,…, L. So l 1 cl  n Then, the Total Entropy (ENTotal) which is
L

nothing but the average information per point in the dataset is defined as follows.

L c  c 
EN   l  log  l  (7.4)
Total l 1 n  2 n
 

117
7.2.2 Experimental Analysis
Assume that the total number of clusters is K with the number of points in the
kth cluster is nk. Let the number of points belong to class l where l = 1, 2,…, L is ckl.
Then the kth cluster entropy (ENk) is given by

L  cl   ck 
k
  l 
EN      log 2  n  (7.5)
k l 1 nk   k 
   

It is clear to note that if any cluster has points of only one class, then ENK is
obviously zero. Now based on the kth cluster entropy, the weighted entropy (wEN)
which is nothing but the average information per point in each cluster is calculated as
follows.

K n 
wEN    k  EN (7.6)
k 1 n  k

If all the clusters are homogeneous, then wEN is zero which indicates that the
whole data set belong to a single cluster. The Normalized Information Gain (NIG) is
calculated by the above equations as follows.

EN  wEN
NIG  Total (7.7)
EN
Total

It is important to note that NIG is zero, if no information is obtained and NIG


is one if the total information is retrieved by the clustering technique. Therefore, NIG
should be close to 1 for good quality clusters.

Initially, the proposed method, K-means, IGDCA and GGCA were applied on
eight two-dimensional synthetic data sets of various shapes and densities. The
comparison results in the below Table 7.1 depict the efficiency of the proposed
method over the existing. The NIG values of the proposed method are larger than the
NIG values of the existing methods K-meaans, IGDCA and GGCA. Hence, as per the
definition of NIG, the proposed method outperforms the existing methods.

118
7.2.2 Experimental Analysis

Table 7.1: Results of synthetic data using normalized information gain (NIG)

Data Data Cluster NIG


Size No. K-means IGDCA GGCA Proposed
DS1 1009 31 0.78 0.85 0.95 0.97
DS2 567 4 0.88 0.89 0.91 0.99
DS3 1967 77 0.67 0.78 0.89 0.93
DS4 837 6 0.82 0.89 0.79 0.94
DS5 1239 25 0.79 0.86 0.90 0.89
DS6 1800 56 0.62 0.81 0.84 0.89
DS7 613 9 0.92 0.92 0.85 0.97
DS8 798 15 0.74 0.77 0.98 0.98

Next, we experimented the proposed and existing methods on eight biological


data sets namely, iris, wine, statlog heart, breast tissue, iima-India-diabetes, cloud,
blood transfusion and yeast taken from UCI machine learning repository [32]. It can
easily be observed from the comparison results shown in the Table 7.2 that the
proposed algorithm consistently produces better results than that of the algorithms K-
means, IGDCA and GGCA in terms of the normalized information gain (NIG).

119
Table 7.2: Results of biological data using normalized information gain (NIG)

Data No. of Data Cluster NIG


Attributes Size No.
K-means IGDCA GGCA Proposed

Iris 4 150 3 0.89 0.90 0.96 0.98

Wine 13 178 3 0.91 0.92 0.95 0.99

Statlog Heart 13 270 2 0.85 0.91 0.86 0.95

120
Breast Tissue 9 106 2 0.82 0.78 0.91 0.90

Pima Indians Diabetes 8 768 2 0.77 0.83 0.81 0.89

Cloud 10 1024 2 0.72 0.86 0.94 0.97

Blood Transfusion 5 748 2 0.81 0.83 0.92 0.93

Yeast 8 1484 10 0.67 0.89 0.88 0.95


7.3 Conclusion

7.3 Conclusion

In this chapter, a new approach has been proposed to find the optimal grid size using
the cluster boundaries. The proposed method is non-parametric and runs with linear
computational complexity. It is insensitive to the outlier points by exploiting the local
outlier factor (LOF). The proposed scheme is experimented on various synthetic and
biological data using normalized information gain (NIG). It has been shown that the
proposed method produces better results than that of K-means, IGDCA and GGCA.

121
CHAPTER 8
SUMMARY AND CONCLUSIONS

In this thesis, we have presented various clustering algorithms of four basic models of
clustering, namely, hierarchical, partitional, density and grid-based. We have
addressed some of the problems associated with the existing algorithms and tried to
resolve them through new or improved algorithms. We have shown the experimental
results of all the proposed algorithms on several benchmark synthetic and biological
data. The results have been compared with the existing algorithms to show the
outperformance of the proposed algorithms.

The introductory part of the thesis has been presented in the opening chapter
1. It comprises of the scope of the thesis, an introduction of four major clustering
models, resources used and organization of the thesis. The chapter 2 is composed of
an extensive review of the existing clustering algorithms of the above four major
clustering models. The various clustering techniques of these models are discussed
along with their merits and demerits.

Our contribution begins with chapter 3 in which two novel MST-based


clustering algorithms have been presented. The algorithm MST-CV has been
designed using the coefficient of variation. This method has also been dealt with the
outliers. The algorithm MST-DVI has been produced the optimal clusters using DVI.
The proposed algorithms outperformed several existing methods, K-means, SFMST,
SC, MSDR, IMST, SAM and MinClue in case of various synthetic and biological
data. It is observed that both the proposed algorithms have quadratic time complexity.

Chapter 4 presented three new clustering algorithms using Voronoi diagram.


The algorithm GCVD produced desired clusters by exploiting the Voronoi edges. The
algorithm Voronoi-Cluster resulted into optimum clusters by using a function defined
with the help of Voronoi circles. Similarly, the algorithm Build-Voronoi generates

122
Summary and Conclusions
efficient clusters by reconstructing the Voronoi diagram. The proposed algorithms
have been shown efficient over K-means, FCM, CTVN, VDAC and SC. The time
complexity of both of these algorithms has been shown to be O (n log n).

Two algorithms to enhance the K-means clustering have been presented in


chapter 5. The algorithm KD-JDF has been effective in case of random initialization
and automation problems of K-means algorithm. Here, the initialization problem has
been dealt with the kd-tree where as the number of clusters is automated by the JDF.
The algorithm KD-Cluster improved the K-means clustering algorithm to be
insensitive to the outliers. The proposed algorithms have been outperformed the
classical and improved K-means algorithms. The time complexity of the proposed kd-
tree based algorithms is O (n log n).

A prototype based approach has been designed in the chapter 6 to speed up


the DBSCAN algorithm. The prototypes in this algorithm are produced using the
squared error clustering technique. The proposed algorithm produced clusters with
less computational cost over the existing techniques I-DBSCAN, DBCAMM,
VDBSCAN and KFWDBSCAN.

In chapter 7, we have proposed a grid clustering algorithm to compute the


grid-size using a novel concept of boundary grids. In this method, we used the local
outlier factor (LOF) to deal with the problem of outliers. The proposed technique is
non-parametric and runs in linear time. The method has been produced better results
than that of K-means, IGDCA and GGCA.

Although, the proposed algorithms efficient in clustering various complex


data, they have few challenges. Our future efforts will be made towards such
problems described as follows. We shall attempt to find a solution to the localization
problem of the proposed MST-based clustering algorithm. The proposed Voronoi
diagram based algorithm has the problem of inputting the K value in advance. We
shall enhance the algorithm in connection with this issue. The performance of the kd-
tree based K-means clustering algorithms is completely depends on the proper

123
Summary and Conclusions
formation of the leaf bucket sizes. Therefore, we shall try to find the best possible
size for the leaf buckets. The notion of the proposed MDBSCAN clustering algorithm
is directed by the neighborhood parameter . We shall make an effort to automate the
proper value of . If the given data has the clusters of diverse densities, the proposed
concept of boundary grids in grid-based clustering may not provide the optimal grid-
size. We shall try to enhance the grid-based clustering algorithm in our future
endeavor for producing the clusters of varied densities.

124

You might also like