You are on page 1of 42

A comparative study of various algorithms to

detect clustering in spatial data

A Graduate Project Report submitted to Manipal Academy of Higher Education


in partial fulfilment of the requirement for the award of the degree of

BACHELOR OF TECHNOLOGY
In
Electronics and Communication Engineering

Submitted by
WOONA HANISH
Reg. No: 160907316
Under the guidance of

EXTERNAL GUIDE
Amitha Puranik INTERNAL GUIDE
Assistant Professor Vishnumurthy Kedlaya K
Department of Data Science &
Prasanna School of Public Health, Department of Electronics &
MAHE, Manipal Communication

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

MANIPAL-56104, KARNATAKA, INDIA

MAY/JUNE 2020
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

MANIPAL-56104, KARNATAKA, INDIA

Manipal
31-05-2020

CERTIFICATE

This is to certify that the project titled A comparative study of various algorithms
to detect clustering in spatial data is a record of the bonafide work done by HANISH
WOONA (Reg. No.160907316) submitted in partial fulfilment of the requirements for the
award of the Degree of Bachelor of Technology (BTech) in ELECTRONICS AND
COMMUNICATION ENGINEERING of Manipal Institute of Technology, Manipal,
Karnataka, (A Constituent unit of Manipal Academy of Higher Education), during the
academic year 2019 - 2020.

Vishnumurthy Kedlaya K Prof. Dr. M. Sathish Kumar


Ast.Prof, ECE HOD, ECE
M.I.T, MANIPAL M.I.T, MANIPAL
ACKNOWLEDGMENTS

This project is guided by Amitha Puranik, Department of Data Science, Prasanna School of
Public Health, MAHE, Manipal. And closely monitored by Vishnumurthy Kedlaya K
ECE,M.I.T, MANIPAL. The data used for this project was provided by Prasanna School of
Public Health, MAHE, Manipal.
ABSTRACT

Spatial clustering can be defined as the process of grouping object with certain dimensions into
groups such that objects within a group exhibits similar characteristics when compared to those
which are in the other groups. It is an important part of spatial data mining since it provides
certain insights into the distribution of data and characteristics of spatial clusters.

Data pre-processing is done manually using excel sheets to match the names of the districts with
the shape file. The shapefile of districts of India will be used in the process of plotting the data
into a map. We will be using a software call ArcGIS for manipulating the shape file. Late this
shape file will be used for clustering analysis. All the results are compared at the end to decide
the pros and cons for each algorithm in various scenarios.
LIST OF TABLES

Table No Table Title Page No


1.1 Project schedule 10
2.1 Literature Review 11
4.1 Comparison of results for aggregate data taking LISA as standard 34
LIST OF FIGURES

Figure No Figure Title Page No


2.1 K vs sum of squares 12
2.2 K-means flow chart 13
2.3 K-Medoids example 14
2.4 AGNES vs DIANA 14
2.5 DBSCAN 15
2.6 CLIQUE clustering 16
2.7 K-means 17
2.8 FUZZY clustering 18
3.1 Methodology flow chat 21
4.1 K value 24
4.2 K-means scattered plot 25
4.3 K-means clustering 25
4.4 K-medoids scattered plot 26
4.5 K-medoid results 26
4.6 AGNES DENDOGRAM 27
4.7 AGNES scattered plot 27
4.8 AGNES result 28
4.9 DBSCAN result 29
4.10 CLIQUE lon vs lat 30
4.11 CLIQUE lon vs rape 31
4.12 CLIQUE lat vs rape 32
4.13 CLIQUE result 32
4.14 LISA result 33
Contents
Page No
Acknowledgement 3
Abstract 4
List Of Figures 5
List Of Tables 6

Chapter 1 INTRODUCTION 8
1.
Introduction 8
1
1.
Present day scenario 8
2
1.
Motivation to do the work 9
3
1.
Objective of the work 9
4
1.
Target Specifications 9
5
1.
Project schedule
6

Chapter 2 BACKGROUND THEORY 10


2.
Introduction 10
1
2.
Introduction to project title 10
2
2.
Literature Review 10
3
2.
Background theory 11
4

Chapter 3 METHODOLOGY 21
3.
Introduction 21
1
3.
Methodology 21
2

Chapter 4 RESULT ANALYSIS 25


4.
Introduction 25
1
4.
Results 25
2

Chapter 5 CONCLUSION AND FUTURE SCOPE 37


REFERENCES 39
PROJECT DETAILS 40
CHAPTER 1
INTRODUCTION

In this chapter introduction of area of the work is done. Brief present day scenario with
regard to the area of the work and Motivation for doing the project is discussed. Both main
and secondary objectives of the work is specified. Project work schedule is mentioned.

1.1 Introduction to the area of the work:

Spatial data, also known as geospatial data or geographic information, is the data or
information that identifies the geographic location of features and boundaries on
earth, such as natural or constructed features, oceans, and more. Spatial data is usually
stored as coordinates and topology and is data that can be mapped.

Cluster analysis is the process of partitioning a set of data objects (or observations)
into subsets. Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters. The set of clusters resulting from a
cluster analysis can be referred to as a clustering. In this context, different clustering
methods may generate different clustering’s on the same data set.

The partitioning is done by the clustering algorithms. Hence, clustering is useful in


discovery of previously unknown groups within the data. It is an important part of
spatial data mining since it provides certain insights into the distribution of data and
characteristics of spatial clusters.

1.2 Present day scenario:

Linear Indicator of spatial auto correlation is one of the most widely used techniques
in spatial clustering
In this project we will be applying various machine learning clustering techniques on
spatial data and compare the results with that of Linear Indicator of Spatial auto
correlation technique for aggregate data and fuzzy algorithm for point data

1.3 Motivation to do the work:

The detection of spatial clusters is important in public health decision making to


allocate resources for health prevention and to make environmental control decisions.
Comparison of the various cluster detection techniques will allow us to provide an
insight regarding the clustering quality and execution time and we can decide on
which clustering technique to use depending on kind of data we have.

1.4 Objective of the work:

To compare K-means, K-medoide, Agnes, DBSCAN, DENLUE, CLIQUE, FUZZY


clustering algorithms in identifying hotspots in terms of four factors such as time
complexity, inputs, handling of higher dimensions and handling of irregularly shaped
clusters.

1.5 Target Specifications:

Analysing the results of different clustering methods and finding the best method
according to the data available to us.
1.6 Project schedule:

Table 1.1 Project schedule


CHAPTER 2
BACKGROUND THEORY

2.1 Introduction:

In this chapter we will discuss the title of the project, Literature review, summarized
outcome of the literature review, General analysis, Mathematical derivations and
conclusions.

2.2 Introduction to project title:

K-means, K-medoide, Agnes, DBSCAN, DENLUE, CLIQUE, STING, LISA


clustering algorithms are used to detect clustering in spatial data and results are
compared to find the best algorithm to a given type of data.

2.3 Literature Review:


TITLE AUTHOR’S NAME/ RESEARCH FINDINGS
SOURCE

Crime Prediction and S.Sivaranjani Implemantation of


Forecasting in Dr.S.Sivakumari KMeans AGNES
Tamilnadu using Aasha.M ,DBSCAN and
Clustering Approaches comparision the
perfomances
Review of Spatial Neethu C V Theoretical comparison
Clustering Methods Mr.Subu Surendran of Spatial clustering
methods
DENCLUE-IM: A New Hajar REHIOUI Type of data required
Approach for Big Data Abdellah IDRISSI for this algorithm
Clustering Manar ABOUREZQ
Faouzia ZEGRARI

Table 2.1: Literature Review


2.4Background theory:
K Means algorithm:

1. To perform this algorithm we start with selecting k number of locations randomly


as the centroids for each.
2. Now we start forming clusters by allotting each observation to the closest centroid
based on Euclidean distance to form clusters.
3. To select the value k we start adding the standard deviation between the centroid
and each observation points
4. We plot a graph with k as x axis and standard deviation on y axis. We plot the
graph for k value starting from 2 to 10-20.

Figure 2.1: K vs sum of squares


5. The graph looks similar to an exponentially decreasing graph. We consider the k
value where the change in sum of standard deviation is significantly less.
Figure 2.2: K-means flow chart

K Medoids algorithm:

1. K-medoids algorithm is developed from K-means algorithm to eliminate the


drawback of not having a observation point at the centroid.
2. We follow the same steps we followed for k means algorithm, but we consider k
observation points as centroids.
Figure 2.3: K-Medoids example

AGNES algorithm:

1. Hierarchical clustering is a method of cluster analysis which seeks to build a


hierarchy of clusters. Strategies for hierarchical clustering generally fall into two
types :
2. Agglomerative: This is a bottom-up approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
3. Divisive: This is a top-down approach: all observations start in one cluster, and splits
are performed recursively as one moves down the hierarchy.

Figure 2.4: AGNES vs DIANA


4. AGNES algorithm works by grouping the data one by one on the basis of the nearest
distance measure of all the pairwise distance between the data point. Again distance
between data points is recalculated

DBSCAN algorithm:

1. Take a point and with epcilon as radius draw a circle


2. If number of points inside the circle are greater than equal to Minpoints then the
above point is considered as a core point
3. If a point doesn’t satisfy Minpoints condition but we have at least one core point
inside it then it becomes a border point.
4. If both the above conditions fail then the point becomes noise point.
5. Only core and border points are considered to form a cluster Noise points are never
taken into consideration.

Figure 2.5: DBSCAN

CLIQUE algorithm:

1. Partition the data space and find the number of points that lie inside each cell of the
partition.
2. Identify the subspaces that contain clusters using the Apriori principle
3. Identify clusters
a. Determine dense units in all subspaces of interests
b. Determine connected dense units in all subspaces of interests.
4. Generate minimal description for the clusters
a. Determine maximal regions that cover a cluster of connected dense units for
each cluster
b. Determination of minimal cover for each cluster

Figure 2.6: CLIQUE clustering


FUZZY algorithm:

1. Fuzzy clustering is a extension of the Kmeans, Kmeans is a one approach of


Partitioning methods.
2. In the fuzzy clustering each data point can belong to more than one cluster ,
each data point has a degree of membership of belonging to each cluster . The
main advantage of fuzzy clustering is that the fuzzy approach yields much more
detailed information on the structure.
3. Formally, given a set of objects, o1, o2,---,on, a fuzzy clustering of k fuzzy
clusters,
C1, C2,---- ,Ck, can be represented using a partition matrix, M = [wij] (1 ≤ i ≤ n,1
≤j ≤ k), where wij is the membership degree of Oi in fuzzy cluster Cj . The
partition matrix should satisfy the following three requirements:
 For each object, oi , and cluster, Cj , 0 ≤wij ≤ 1. This requirement enforces
that a fuzzy cluster is a fuzzy set.

 For each object, oi , This requirement ensures that every object


participates in the clustering equivalently.

 For each cluster, Cj , This requirement ensures that for every


cluster, there is at least one object for which the membership value is nonzero.
Figure 2.7: K-means

Figure 2.8: FUZZY clustering

LISA algorithm:

Assigning points to data:

If a user has information on the location of individual events, then it is better to utilize
that information with the point statistics. The individual-level information will contain
all the uniqueness of the events.

However, sometimes it is not possible to analyze data at the individual level. The
user may need to aggregate the individual data points to spatial areas in order to
compare the events to data that are only obtained for zones, such as census data, or to
model environmental correlates of the data points or may find that individual data are
not available. In this case, the individual data points are allocated to zones by, first,
spatially assigning them to the zones in which they fall and, second, counting the
number of points assigned to each zone. A user can do this with a GIS program or
with the “Assign Primary points to Secondary Points” routine.
In this case, the zone becomes the unit of analysis instead of the individual data
points. All the incidents are assigned to a single geographical coordinate, typically
the centroid of the zone, and the number of incidents in the zone becomes an attribute
of the zone.

Thus, the distance between zones is a singular value for all the points in those zones
whereas there is much greater variability with the distances between individual
events.
Further, zones have attributes which are properties of the zone, not of the individual
events. The attribute can be a count or a continuous variable for a distributional
property of the zone.

Moran’s “I” Statistic:

Moran’s “I” statistic is one of the oldest indicators of spatial autocorrelation. It is


applied to zones or points that have attribute variables associated with them. For any
continuous variable, Xi, a mean, can be calculated and the deviation of any one
observation from that mean can also be calculated. The statistic then compares the
value of the variable at any one location with the value at all other locations.
Formally, it is defined as:

Where, N is the number of cases, Xi is the value of a variable at a particular location,


i, Xj is the value of the same variable at another location (where i =/ j), X is the mean
of the variable and Wij is a weight applied to the comparison between location i and
location j.

In Moran’s initial formulation, the weight variable, Wij, was a contiguity matrix. If
zone j is adjacent to zone i, the interaction receives a weight of 1. Otherwise, the
interaction receives a weight of 0. Cliff and Ord (1973) generalized these definitions
to include any type of weight. In more current use, Wij, is a distance-based weight
which is the inverse distance between locations i and j (1/dij). CrimeStat uses this
interpretation. Essentially, it is a weighted Moran=s I where the weight is an inverse
distance.

Unlike a correlation coefficient, the theoretical value of the index does not equal 0 for
lack of spatial dependence, but instead is negative but very close to 0:

Values of “I” above the theoretical mean, E(I), indicate positive spatial
autocorrelation while values of “I” below the theoretical mean indicate negative
spatial autocorrelation.

Adjust for small values:

CrimeStat calculates the weighted Moran=s I formula using equation above However,
there is one problem with this formula that can lead to unreliable results. The distance
weight between two locations, Wij, is defined as the reciprocal of the distance
between the two points, consistent with Moran’s original formulation:

As dij becomes small, then Wij becomes very large, approaching infinity as the
distance between the points approaches 0. If the two zones were next to each other,
which would be true for two adjacent blocks for example, then the pair of
observations would have a very high weight, sufficient to distort the “I” value for the
entire sample. Further, there is a scale problem that alters the value of the weight. If
the zones are police precincts, for example, then the minimum distance between
precincts will be a lot larger than the minimum distance between a smaller
geographical units, such as a block. We need to take into account these scales

CrimeStat includes an adjustment for small distances so that the maximum weight can
never be greater than 1.0. The adjustment scales distances to one mile, which is a
typical distance unit in the measurement of crime incidents. When the small distance
adjustment is turned on, the minimal distance is automatically scaled to be one mile.
The formula used is:
CHAPTER 3
METHODOLOGY

3.1 Introduction:

In this chapter detailed methodology and Tools used will be discussed.

3.2 methodology:

Data Preprocessing
Step I

Mapping the data into shapefile


Step II

Performing clustering analysis


Step III

Comparing the results based on the clustering


Sep IV parameters

Figure 3.1: Methodology flow chat

1. Data Pre-processing:

We are using the crime data of rapes on female of the year 2013. The raw data
with which we are dealing must have same number of objects and same names for objects
in order to map with the shape file.

So it involves some manual cross checking for object names (district names). And
for some districts the data was divided into sub-districts data. All these sub-districts data
should be re-joined into a district.
2. Mapping the data into shape file:

The shape file is a geospatial vector data format for geographic information
system (GIS) software.

We use two different formats of shapefile for this project.


shp – Has Geospatial visualization
dbf – Has data of each object in an excel sheet

The data from the excel sheet is mapped into the Indian districts shapefile.

We use a software called ArcGIS for mapping data into shapefile.

Mapping the date into a shape file is an important step, which can be later used to
perform clustering analysis.

3. Performing clustering algorithms

4. Comparing the results based on the clustering parameters:

Jaccard Index:
The Jaccard similarity index (sometimes called the Jaccard similarity coefficient)
compares members for two sets to see which members are shared and which are
distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to
100%. The higher the percentage, the more similar the two populations.

The formula to find the Index is:


Jaccard Index = (the number in both sets)/(the number in either sets)*100

This percentage tells you how similar the two sets are. Two sets that share all
members would be 100% similar. the closer to 100%, the more similarity.

Jaccard Distance:
A similar statistic, the Jaccard distance, is a measure of how dissimilar two sets are. It
is the complement of the Jaccard index and can be found by subtracting the Jaccard
Index from 100%.
D(X,Y) = 1 – J(X,Y)
Rand Index:
The Rand index or Rand measure in statistics, and in particular in data clustering, is a
measure of the similarity between two data clusterings. A form of the Rand index may
be defined that is adjusted for the chance grouping of elements, this is the adjusted
Rand index. From a mathematical standpoint, Rand index is related to the accuracy,
but is applicable even when class labels are not used.

Given a set of n elements S={o1,…..on} and two partitions of S to compare X={X1,…


Xr},a partition of S into r subsets and Y={Y1,…Yn},a partition of S into s subsets,define the
following:

 a,the number of pairs of elements in S that are in the same subset in X and in the same
subset in Y.

 b,the number of pairs of elements in S that are different subsets in X and in the
different subsets in Y.

 c,the number of pairs of elements in S that are in the same subset in X and in the
different subsets in Y.

 d,the number of pairs of elements in S that are in the different subsets in X and in the
same subset in Y.

The Rand index R is:

a+b can be considered as the number of agreements between X and Y and c + d as the
number of disagreements between X and Y.
Since the denominator is the total number of pairs, the Rand index represents the frequency
of occurrence of agreements over the total pairs, or the probability that X and Y will agree on
a randomly chosen pair.

Similarly, one can also view the Rand index as a measure of the percentage of correct
decisions made by the algorithm. It can be computed using the following formula:

where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives.
CHAPTER 4
RESULT ANALYSIS

4.1 Introduction:

In this chapter results are analysed and Significance of the result obtained are
discussed.

4.2 Results:

K-means:

As mentioned in the background literature first we need to find the suitable k value for this
scenario.

Figure 4.1 K value


Here from k=4 to k=5, there is no much difference in change in sum of the squares. So we are
considering K as 4.

Figure 4.2: K-means scattered plot


Figure 4.3: K-means clustering

K-Medoids:

We have already computed K value for k-means so we are using the same value here K=4.
Figure 4.4: K-medoids scattered plot
Figure 4.5: K-medoid results

Agnes:

DENDOGRAM

Figure 4.6: AGNES DENDOGRAM

Figure 4.7: AGNES scattered plot


Figure 4.8: AGNES result
DBSCAN:

Figure 4.9: DBSCAN result


CLIQUE:

Figure 4.10: CLIQUE lon vs lat


Figure 4.11: CLIQUE lon vs rape

Figure 4.12: CLIQUE lat vs rape


FUZZY:
In fuzzy we did soft clustering of 3 clusters(red ,green ,blue).

Figure 4.13: CLIQUE result


LISA:

Figure 4.13 LISA result


COMPARISO:

From the roots LISA algorithm is specifically designed for the aggrigated spatial data, so we
are connsidering this algorithm as the standard across this project.

We are comparing all algorithms with lisa using Jaccard index and Rand index.
Comparison of results for aggregate data taking LISA as standard:

SL No: Algorithm Jaccard Index Rand Index


1 K means 26.5 68.5
2 K medoid 23 68.3
3 AGNES 29 69.8
4 FUZZY 28.12 68.7
5 DBSCAN 30.4 71.3
Table 4.1: Comparison of results for aggregate data taking LISA as standard
Regarding CLIQUE, STING, DENCLUE algorithm:

Altho CLIQUUE algorithm forming clusters, these clusters have no significants. It is more
suitable for point data where clusters don’t need to have a specific significants.

STING and DENCLUE algorithms cannot handle aggrigated data so the can’t even form any
clustering.
CHAPTER 5
CONCLUSION AND FUTURE SCOPE OF WORK

We have presented an overview of clustering algorithms that are useful to the spatial
clustering analysis. We categorize them into four categories

1. Partitioning-based

2. Hierarchical-based

3. Density-based

4. Grid-based

Partitioning methods like k-means and k-medoids are methods which make uses of a
technique called iterative reallocation to improve clustering quality from an initial solution.
As these methods find clusters that are of spherical shape and similar in size, they are more
useful for applications like facility allocation where the objective is not to find natural cluster
but to minimize the sum of distances from the data objects to their cluster centers.

Unlike the partitioning-based clustering algorithms which reallocate data objects from one
cluster to another in order to improve the clustering quality, hierarchical clustering like
AGNES algorithms fixed the membership of a data object once it has been allocated to a
cluster.

Instead of using distance to judge the membership of a data object, density-based clustering
algorithm like DBSCAN make use of the density of data points within a region to discover
clusters. DBSCAN results in a loss of efficiency for high dimensional clustering. This
problem is addressed by DENCLUE which models the overall density of a point to handle the
computation efficiently.

To increase the efficiency of clustering grid based clustering methods approximate the dense
regions of the clustering space by quantizing it into a finite number of cells and identifying
cells that contain more than a number of points as dense. Grid based approach is usually more
efficient than a density-based approach.

To conclude the hierarchical clustering methods are similar in performance but takes more
time as compared to the others. The performance of partition based clustering methods like k-
means and k-medoid algorithms are not well in handling irregularly shaped clusters. The
density based methods and grid based methods are more suitable for handling spatial data but
when considering time complexity grid based methods are more preferable.

The problem with LISA is it requires frequency of events associated with the data point so it
is not suitable for point data where each crime is reported individually which makes the count
of each data point as one. From the research papers we concluded fuzzy is the best when
dealing with point data .fuzzy shows decent results in aggregate.

Partition methods like k-means and k-medoid shows decent values for aggregate data. These
are highly efficient for point data.
REFERENCES

[1]. Neethu C V and Mr.Subu Surendra, “Review of Spatial Clustering Methods”, SCT
College of Engineering Trivandrum,India,2013,24.
[2]. S.Sivaranjani, Dr.S.Sivakumari and Aasha.M, “Crime Prediction and Forecasting in
Tamilnadu using Clustering Approaches”, Avinashilingam University Coimbatore,
India,2016,6
[3]. Tony H. Grubesic “On The Application of Fuzzy Clustering for Crime Hot Spot
Detection”
[4]. Wei Luo ,Michael Steptoe ,Zheng Chang ,Robert Link , Leon Clarke and Ross Maciejewski
“Impact of Spatial Scales on the Intercomparison of Climate Scenarios”
PROJECT DETAILS

Student Details
Student Name Hanish Woona
Register Number 160907316 Section / Roll No B/42
Email Address woonahanish@gmail.com Phone No (M) 8639004674

Project Details
A comparative study of various algorithms to detect clustering in
Project Title
spatial data
Project Duration 4 Months Date of reporting 31-05-2020

Organization Details
Organization Name Prasanna School of Public Health
Department of Data Science
Full postal address with
Prasanna School of Public Health,
pin code
MAHE, Manipal
Website address https://manipal.edu/mu.html

You might also like