Professional Documents
Culture Documents
BACHELOR OF TECHNOLOGY
In
Electronics and Communication Engineering
Submitted by
WOONA HANISH
Reg. No: 160907316
Under the guidance of
EXTERNAL GUIDE
Amitha Puranik INTERNAL GUIDE
Assistant Professor Vishnumurthy Kedlaya K
Department of Data Science &
Prasanna School of Public Health, Department of Electronics &
MAHE, Manipal Communication
MAY/JUNE 2020
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
Manipal
31-05-2020
CERTIFICATE
This is to certify that the project titled A comparative study of various algorithms
to detect clustering in spatial data is a record of the bonafide work done by HANISH
WOONA (Reg. No.160907316) submitted in partial fulfilment of the requirements for the
award of the Degree of Bachelor of Technology (BTech) in ELECTRONICS AND
COMMUNICATION ENGINEERING of Manipal Institute of Technology, Manipal,
Karnataka, (A Constituent unit of Manipal Academy of Higher Education), during the
academic year 2019 - 2020.
This project is guided by Amitha Puranik, Department of Data Science, Prasanna School of
Public Health, MAHE, Manipal. And closely monitored by Vishnumurthy Kedlaya K
ECE,M.I.T, MANIPAL. The data used for this project was provided by Prasanna School of
Public Health, MAHE, Manipal.
ABSTRACT
Spatial clustering can be defined as the process of grouping object with certain dimensions into
groups such that objects within a group exhibits similar characteristics when compared to those
which are in the other groups. It is an important part of spatial data mining since it provides
certain insights into the distribution of data and characteristics of spatial clusters.
Data pre-processing is done manually using excel sheets to match the names of the districts with
the shape file. The shapefile of districts of India will be used in the process of plotting the data
into a map. We will be using a software call ArcGIS for manipulating the shape file. Late this
shape file will be used for clustering analysis. All the results are compared at the end to decide
the pros and cons for each algorithm in various scenarios.
LIST OF TABLES
Chapter 1 INTRODUCTION 8
1.
Introduction 8
1
1.
Present day scenario 8
2
1.
Motivation to do the work 9
3
1.
Objective of the work 9
4
1.
Target Specifications 9
5
1.
Project schedule
6
Chapter 3 METHODOLOGY 21
3.
Introduction 21
1
3.
Methodology 21
2
In this chapter introduction of area of the work is done. Brief present day scenario with
regard to the area of the work and Motivation for doing the project is discussed. Both main
and secondary objectives of the work is specified. Project work schedule is mentioned.
Spatial data, also known as geospatial data or geographic information, is the data or
information that identifies the geographic location of features and boundaries on
earth, such as natural or constructed features, oceans, and more. Spatial data is usually
stored as coordinates and topology and is data that can be mapped.
Cluster analysis is the process of partitioning a set of data objects (or observations)
into subsets. Each subset is a cluster, such that objects in a cluster are similar to one
another, yet dissimilar to objects in other clusters. The set of clusters resulting from a
cluster analysis can be referred to as a clustering. In this context, different clustering
methods may generate different clustering’s on the same data set.
Linear Indicator of spatial auto correlation is one of the most widely used techniques
in spatial clustering
In this project we will be applying various machine learning clustering techniques on
spatial data and compare the results with that of Linear Indicator of Spatial auto
correlation technique for aggregate data and fuzzy algorithm for point data
Analysing the results of different clustering methods and finding the best method
according to the data available to us.
1.6 Project schedule:
2.1 Introduction:
In this chapter we will discuss the title of the project, Literature review, summarized
outcome of the literature review, General analysis, Mathematical derivations and
conclusions.
K Medoids algorithm:
AGNES algorithm:
DBSCAN algorithm:
CLIQUE algorithm:
1. Partition the data space and find the number of points that lie inside each cell of the
partition.
2. Identify the subspaces that contain clusters using the Apriori principle
3. Identify clusters
a. Determine dense units in all subspaces of interests
b. Determine connected dense units in all subspaces of interests.
4. Generate minimal description for the clusters
a. Determine maximal regions that cover a cluster of connected dense units for
each cluster
b. Determination of minimal cover for each cluster
LISA algorithm:
If a user has information on the location of individual events, then it is better to utilize
that information with the point statistics. The individual-level information will contain
all the uniqueness of the events.
However, sometimes it is not possible to analyze data at the individual level. The
user may need to aggregate the individual data points to spatial areas in order to
compare the events to data that are only obtained for zones, such as census data, or to
model environmental correlates of the data points or may find that individual data are
not available. In this case, the individual data points are allocated to zones by, first,
spatially assigning them to the zones in which they fall and, second, counting the
number of points assigned to each zone. A user can do this with a GIS program or
with the “Assign Primary points to Secondary Points” routine.
In this case, the zone becomes the unit of analysis instead of the individual data
points. All the incidents are assigned to a single geographical coordinate, typically
the centroid of the zone, and the number of incidents in the zone becomes an attribute
of the zone.
Thus, the distance between zones is a singular value for all the points in those zones
whereas there is much greater variability with the distances between individual
events.
Further, zones have attributes which are properties of the zone, not of the individual
events. The attribute can be a count or a continuous variable for a distributional
property of the zone.
In Moran’s initial formulation, the weight variable, Wij, was a contiguity matrix. If
zone j is adjacent to zone i, the interaction receives a weight of 1. Otherwise, the
interaction receives a weight of 0. Cliff and Ord (1973) generalized these definitions
to include any type of weight. In more current use, Wij, is a distance-based weight
which is the inverse distance between locations i and j (1/dij). CrimeStat uses this
interpretation. Essentially, it is a weighted Moran=s I where the weight is an inverse
distance.
Unlike a correlation coefficient, the theoretical value of the index does not equal 0 for
lack of spatial dependence, but instead is negative but very close to 0:
Values of “I” above the theoretical mean, E(I), indicate positive spatial
autocorrelation while values of “I” below the theoretical mean indicate negative
spatial autocorrelation.
CrimeStat calculates the weighted Moran=s I formula using equation above However,
there is one problem with this formula that can lead to unreliable results. The distance
weight between two locations, Wij, is defined as the reciprocal of the distance
between the two points, consistent with Moran’s original formulation:
As dij becomes small, then Wij becomes very large, approaching infinity as the
distance between the points approaches 0. If the two zones were next to each other,
which would be true for two adjacent blocks for example, then the pair of
observations would have a very high weight, sufficient to distort the “I” value for the
entire sample. Further, there is a scale problem that alters the value of the weight. If
the zones are police precincts, for example, then the minimum distance between
precincts will be a lot larger than the minimum distance between a smaller
geographical units, such as a block. We need to take into account these scales
CrimeStat includes an adjustment for small distances so that the maximum weight can
never be greater than 1.0. The adjustment scales distances to one mile, which is a
typical distance unit in the measurement of crime incidents. When the small distance
adjustment is turned on, the minimal distance is automatically scaled to be one mile.
The formula used is:
CHAPTER 3
METHODOLOGY
3.1 Introduction:
3.2 methodology:
Data Preprocessing
Step I
1. Data Pre-processing:
We are using the crime data of rapes on female of the year 2013. The raw data
with which we are dealing must have same number of objects and same names for objects
in order to map with the shape file.
So it involves some manual cross checking for object names (district names). And
for some districts the data was divided into sub-districts data. All these sub-districts data
should be re-joined into a district.
2. Mapping the data into shape file:
The shape file is a geospatial vector data format for geographic information
system (GIS) software.
The data from the excel sheet is mapped into the Indian districts shapefile.
Mapping the date into a shape file is an important step, which can be later used to
perform clustering analysis.
Jaccard Index:
The Jaccard similarity index (sometimes called the Jaccard similarity coefficient)
compares members for two sets to see which members are shared and which are
distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to
100%. The higher the percentage, the more similar the two populations.
This percentage tells you how similar the two sets are. Two sets that share all
members would be 100% similar. the closer to 100%, the more similarity.
Jaccard Distance:
A similar statistic, the Jaccard distance, is a measure of how dissimilar two sets are. It
is the complement of the Jaccard index and can be found by subtracting the Jaccard
Index from 100%.
D(X,Y) = 1 – J(X,Y)
Rand Index:
The Rand index or Rand measure in statistics, and in particular in data clustering, is a
measure of the similarity between two data clusterings. A form of the Rand index may
be defined that is adjusted for the chance grouping of elements, this is the adjusted
Rand index. From a mathematical standpoint, Rand index is related to the accuracy,
but is applicable even when class labels are not used.
a,the number of pairs of elements in S that are in the same subset in X and in the same
subset in Y.
b,the number of pairs of elements in S that are different subsets in X and in the
different subsets in Y.
c,the number of pairs of elements in S that are in the same subset in X and in the
different subsets in Y.
d,the number of pairs of elements in S that are in the different subsets in X and in the
same subset in Y.
a+b can be considered as the number of agreements between X and Y and c + d as the
number of disagreements between X and Y.
Since the denominator is the total number of pairs, the Rand index represents the frequency
of occurrence of agreements over the total pairs, or the probability that X and Y will agree on
a randomly chosen pair.
Similarly, one can also view the Rand index as a measure of the percentage of correct
decisions made by the algorithm. It can be computed using the following formula:
where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives.
CHAPTER 4
RESULT ANALYSIS
4.1 Introduction:
In this chapter results are analysed and Significance of the result obtained are
discussed.
4.2 Results:
K-means:
As mentioned in the background literature first we need to find the suitable k value for this
scenario.
K-Medoids:
We have already computed K value for k-means so we are using the same value here K=4.
Figure 4.4: K-medoids scattered plot
Figure 4.5: K-medoid results
Agnes:
DENDOGRAM
From the roots LISA algorithm is specifically designed for the aggrigated spatial data, so we
are connsidering this algorithm as the standard across this project.
We are comparing all algorithms with lisa using Jaccard index and Rand index.
Comparison of results for aggregate data taking LISA as standard:
Altho CLIQUUE algorithm forming clusters, these clusters have no significants. It is more
suitable for point data where clusters don’t need to have a specific significants.
STING and DENCLUE algorithms cannot handle aggrigated data so the can’t even form any
clustering.
CHAPTER 5
CONCLUSION AND FUTURE SCOPE OF WORK
We have presented an overview of clustering algorithms that are useful to the spatial
clustering analysis. We categorize them into four categories
1. Partitioning-based
2. Hierarchical-based
3. Density-based
4. Grid-based
Partitioning methods like k-means and k-medoids are methods which make uses of a
technique called iterative reallocation to improve clustering quality from an initial solution.
As these methods find clusters that are of spherical shape and similar in size, they are more
useful for applications like facility allocation where the objective is not to find natural cluster
but to minimize the sum of distances from the data objects to their cluster centers.
Unlike the partitioning-based clustering algorithms which reallocate data objects from one
cluster to another in order to improve the clustering quality, hierarchical clustering like
AGNES algorithms fixed the membership of a data object once it has been allocated to a
cluster.
Instead of using distance to judge the membership of a data object, density-based clustering
algorithm like DBSCAN make use of the density of data points within a region to discover
clusters. DBSCAN results in a loss of efficiency for high dimensional clustering. This
problem is addressed by DENCLUE which models the overall density of a point to handle the
computation efficiently.
To increase the efficiency of clustering grid based clustering methods approximate the dense
regions of the clustering space by quantizing it into a finite number of cells and identifying
cells that contain more than a number of points as dense. Grid based approach is usually more
efficient than a density-based approach.
To conclude the hierarchical clustering methods are similar in performance but takes more
time as compared to the others. The performance of partition based clustering methods like k-
means and k-medoid algorithms are not well in handling irregularly shaped clusters. The
density based methods and grid based methods are more suitable for handling spatial data but
when considering time complexity grid based methods are more preferable.
The problem with LISA is it requires frequency of events associated with the data point so it
is not suitable for point data where each crime is reported individually which makes the count
of each data point as one. From the research papers we concluded fuzzy is the best when
dealing with point data .fuzzy shows decent results in aggregate.
Partition methods like k-means and k-medoid shows decent values for aggregate data. These
are highly efficient for point data.
REFERENCES
[1]. Neethu C V and Mr.Subu Surendra, “Review of Spatial Clustering Methods”, SCT
College of Engineering Trivandrum,India,2013,24.
[2]. S.Sivaranjani, Dr.S.Sivakumari and Aasha.M, “Crime Prediction and Forecasting in
Tamilnadu using Clustering Approaches”, Avinashilingam University Coimbatore,
India,2016,6
[3]. Tony H. Grubesic “On The Application of Fuzzy Clustering for Crime Hot Spot
Detection”
[4]. Wei Luo ,Michael Steptoe ,Zheng Chang ,Robert Link , Leon Clarke and Ross Maciejewski
“Impact of Spatial Scales on the Intercomparison of Climate Scenarios”
PROJECT DETAILS
Student Details
Student Name Hanish Woona
Register Number 160907316 Section / Roll No B/42
Email Address woonahanish@gmail.com Phone No (M) 8639004674
Project Details
A comparative study of various algorithms to detect clustering in
Project Title
spatial data
Project Duration 4 Months Date of reporting 31-05-2020
Organization Details
Organization Name Prasanna School of Public Health
Department of Data Science
Full postal address with
Prasanna School of Public Health,
pin code
MAHE, Manipal
Website address https://manipal.edu/mu.html