Professional Documents
Culture Documents
7.1 Introduction
The grid-based methods have widely been used over all the algorithms discussed in
previous chapters due to their rapid clustering results. In this chapter, a non-
parametric grid-based clustering algorithm is presented using the concept of boundary
grids and local outlier factor [31].
Let D = (x1, x2,…, xn) be the given set of n data points embedded in an m dimensional
space. Each xi is composed of m attributes. i.e., xi = (xi1, xi2,…, xim). Initially a single
grid is used to represent all the given n points. The extrema of this grid are taken as
the minimum and maximum attribute values in each dimension [129]. Let Min(i) =
min{x1i, x2i,..., xni} and Max(i) = max{x1i, x2i,..., xni}. Then the initial grid G(n;m) is
represented as follows. G(n;m) = [Min(1), Max(1)] × [Min(2), Max(2)] × … ×
[Min(m), Max(m)]. Initially the value of k (number of clusters) is one. Now, we
partition the grid G(n;m) into two equal volume grids G1(n1;m) and G2(n2;m) in a
uniformly selected dimension m. Now, the data points of G(n;m) are distributed to
these two grids. Here, we define the grids (cells) which contain at least one of the
given points as non-empty grids. All the other grids are remained as empty grids. The
empty grids which share the common vertex with non-empty grids are known as
boundary grids. It can easily be noted that each cluster is surrounded by the boundary
grids. After each round of partitioning of grids, it is necessary to check the presence
of new clusters. Here, a cluster is defined as the collection of points in the non-empty-
grids which are connected by the common vertices in the grid structure. In the next
round of partitioning, the two grids G1(n1;m) and G2(n2;m) are partitioned into four
110
7.2 Algorithm Based on Local Outlier Factor
equal volume grids G1,1(n3;m), G1,2(n4;m); and G2,1(n5;m) and G2,2(n6;m) in another
chosen dimension. Then the points are distributed as per the new grids. In this way all
the grids of earlier iteration are bisected and the partitioning process is continued until
a termination condition is met. In order to find the case of optimal grid structure we
use the boundary grids. As discussed above, a cluster is defined as the collection of
the points of directly connected non-empty grids. Each cluster is surrounded by the
boundary grids. We also define the volume of a cluster as sum of the volumes of all
the non-empty grids of that cluster. Here, we have observed that the volume of a
cluster strictly decreases as the partitioning process continues. At the same time, the
number of surrounding empty grids (boundary grids) increases significantly. This is
clearly depicted in the Figures 7.1(a-e). If the required clusters are formed and we
continue the partitioning process, a few clusters split into multiple clusters as shown
in the Figure 7.1(e). Some of these clusters are surrounded by very small number of
boundary grids compared to the clusters of previous iteration. For instance, five
clusters surrounded by very less boundary grids are shown in the Figure 7.1(e).
Therefore, this case indicates that the clusters of previous iteration are of our
objective and the corresponding grid size is the optimal one compared to the sizes of
all the other iterations.
111
7.2 Algorithm Based on Local Outlier Factor
(a) (b)
(c) (d)
(e)
Figure 7.1: Clusters produced by grouping the points in the grids with common
vertices (a) one cluster; (b) five clusters; (c) five clusters; (d) more than five clusters;
(e) a cluster partitioned into several sub-clusters with less boundary grids.
112
7.2.1 Problem of Outliers
The above idea is sensitive if the given data set involves outlier points. As the outliers
are far from the clusters, it can be observed that these points form separate clusters
during the partitioning process. This happens because the outlier region is separated
from the other clusters as the number of grids increases (or the grid size decreases).
As per our definition of a cluster, the outlier points are surrounded by the boundary
grids. Here, the number of boundary grids is significantly less compared to the
boundary grids of other clusters which indicate the termination stage as per the above
discussion. Therefore, the clusters may not be fully formed as the partitioning process
is uncompleted with the effect of outliers. Hence, we need a measure of finding an
outlier to continue the partitioning process smoothly. For that purpose, we have used
the local outlier factor (LOF) proposed by Breunig et al. [31] to compute the degree
to which a point is an outlier.
where xmp is the mpth nearest neighbor of x. Thus N(x, mp) contains at least
mp points. The density of a point x S is defined as follows.
N x, mp
densityx, mp (7.1)
y N x, mp d x, y
If the distances between x and its neighboring points are small, then the
density of x is high. Then the average relative density (ard) of x is calculated as
below.
density x, mp
ard x, mp (7.2)
y N x, mp density y, mp
N x , mp
113
Now, the local outlier factor (LOF) of x is defined as the inverse of the
average relative density of x.
1
i.e., LOF (x, mp) = (7.3)
ard x, mp
If a point belongs to any cluster, then its corresponding value of LOF is closer
to one because the density of that point and the densities of its neighboring points are
roughly equal.
If any cluster has significantly less number of boundary grids in any iteration
of partitioning, we compute the LOF value for all the points of that cluster by using
the above definition. If any point is an outlier we continue the partitioning process.
The process is terminated otherwise. The pseudo code of this algorithm is provided
below.
114
Step 1: Call Grid (n; m);
Step 2: p 1;
Step 3: Call Partition (G) to find the equal volume grids G1 and G2.
Step 4: Call Connected () to produce the clusters (say, C1, C2,…, Cl for some l) by
grouping points in the grids which are connected by the common vertices in
the grid structure.
Step 5: Find the number of boundary grids BGj of all the clusters C1, C2,…, Cl ;
Step 6: If (p > 1) then
If the number of boundary grids of any cluster C of t points is
less than 20% of the boundary grids of any cluster in (p-1)th
iteration then
Go to step 7.
Else
{
p p + 1;
Go to step 3 to call the Grid function for all the
grids.
}
Else
Go to step 3 to call the Grid function for all the
grids.
Step 7: Call LOF (xq) xq C, 1 ≤ q ≤ t;
Step 8: If LOF (xq) ≈ 1 xq C then Go to step 9;
Else /* Sign of outliers */
p p + 1;
Go to step 3 to continue the partitioning process for all the grids.
Step 9: Output the clusters C1, C2,…, Ck with respect to the grid size in the (p-1)th
iteration.
Step 10: Exit ();
□
115
Function Grid (n; m)
{
Min (y) = min {x1y, x2y,..., xny}, 1 ≤ y ≤ m;
Max (y) = max {x1y, x2y,..., xny}, 1 ≤ y ≤ m;
G(n;m) = [Min(1), Max(1)] × [Min(2), Max(2)] × … × [Min(m), Max(m)].
i.e., Gn; m Min y , Max y
1 y m
}
Connected ( )
{
Step 1: l 1;
Step 2: Start with any random grid G.
Step 3: If G is non-visited, then
mark it as visited and add all the points of G in Cl.
i.e., Cl Cl {points in the grid G};
Else Go to step 2.
Step 4: Find the non-visited grids Gr (for some r) shared by G with
any common vertex.
Step 5: Add the points of all Gr to the cluster Cl and Gr is visited.
i.e., Cl Cl {points in the grids Gr};
Step 6: Repeat the steps 4 and 5 for all the grids Gr until no new grid is
identified.
Step 7: If C j n then
j
l l + 1;
Go to step 2 to restart the process.
Else
Return (C1, C2,…, Cj);
116
7.2.2 Experimental Analysis
Time Complexity: The proposed algorithm constructs the grid structure of the given
data points for a finite number of times (say p) with respect to the uniformly chosen
dimensions. This task requires O (pn) time as the partitioning process in an iteration
requires linear time. Initially, the local outlier factor (LOF) is computed for the points
(if existed) from the cluster that represents the outliers. The LOF is also computed for
non-outlier points only in the last iteration of the algorithm. Hence, the LOF is
computed very less number of times (say h) compared to the given data points (n).
This is required O (hn) time. It can easily be seen that both p and h are very small
compared to n. Therefore, the overall time complexity of the proposed algorithm may
be linear.
the class l, for l = 1, 2,…, L. So l 1 cl n Then, the Total Entropy (ENTotal) which is
L
nothing but the average information per point in the dataset is defined as follows.
L c c
EN l log l (7.4)
Total l 1 n 2 n
117
7.2.2 Experimental Analysis
Assume that the total number of clusters is K with the number of points in the
kth cluster is nk. Let the number of points belong to class l where l = 1, 2,…, L is ckl.
Then the kth cluster entropy (ENk) is given by
L cl ck
k
l
EN log 2 n (7.5)
k l 1 nk k
It is clear to note that if any cluster has points of only one class, then ENK is
obviously zero. Now based on the kth cluster entropy, the weighted entropy (wEN)
which is nothing but the average information per point in each cluster is calculated as
follows.
K n
wEN k EN (7.6)
k 1 n k
If all the clusters are homogeneous, then wEN is zero which indicates that the
whole data set belong to a single cluster. The Normalized Information Gain (NIG) is
calculated by the above equations as follows.
EN wEN
NIG Total (7.7)
EN
Total
Initially, the proposed method, K-means, IGDCA and GGCA were applied on
eight two-dimensional synthetic data sets of various shapes and densities. The
comparison results in the below Table 7.1 depict the efficiency of the proposed
method over the existing. The NIG values of the proposed method are larger than the
NIG values of the existing methods K-meaans, IGDCA and GGCA. Hence, as per the
definition of NIG, the proposed method outperforms the existing methods.
118
7.2.2 Experimental Analysis
Table 7.1: Results of synthetic data using normalized information gain (NIG)
119
Table 7.2: Results of biological data using normalized information gain (NIG)
120
Breast Tissue 9 106 2 0.82 0.78 0.91 0.90
7.3 Conclusion
In this chapter, a new approach has been proposed to find the optimal grid size using
the cluster boundaries. The proposed method is non-parametric and runs with linear
computational complexity. It is insensitive to the outlier points by exploiting the local
outlier factor (LOF). The proposed scheme is experimented on various synthetic and
biological data using normalized information gain (NIG). It has been shown that the
proposed method produces better results than that of K-means, IGDCA and GGCA.
121
CHAPTER 8
SUMMARY AND CONCLUSIONS
In this thesis, we have presented various clustering algorithms of four basic models of
clustering, namely, hierarchical, partitional, density and grid-based. We have
addressed some of the problems associated with the existing algorithms and tried to
resolve them through new or improved algorithms. We have shown the experimental
results of all the proposed algorithms on several benchmark synthetic and biological
data. The results have been compared with the existing algorithms to show the
outperformance of the proposed algorithms.
The introductory part of the thesis has been presented in the opening chapter
1. It comprises of the scope of the thesis, an introduction of four major clustering
models, resources used and organization of the thesis. The chapter 2 is composed of
an extensive review of the existing clustering algorithms of the above four major
clustering models. The various clustering techniques of these models are discussed
along with their merits and demerits.
122
Summary and Conclusions
efficient clusters by reconstructing the Voronoi diagram. The proposed algorithms
have been shown efficient over K-means, FCM, CTVN, VDAC and SC. The time
complexity of both of these algorithms has been shown to be O (n log n).
123
Summary and Conclusions
formation of the leaf bucket sizes. Therefore, we shall try to find the best possible
size for the leaf buckets. The notion of the proposed MDBSCAN clustering algorithm
is directed by the neighborhood parameter . We shall make an effort to automate the
proper value of . If the given data has the clusters of diverse densities, the proposed
concept of boundary grids in grid-based clustering may not provide the optimal grid-
size. We shall try to enhance the grid-based clustering algorithm in our future
endeavor for producing the clusters of varied densities.
124