You are on page 1of 15

Project 2

Clustering Algorithms

Team Members
Chaitanya Vedurupaka(50205782)
Anirudh Yellapragada(50206970)
k-Means:
k-means is one of the unsupervised learning techniques which is used to solve the clustering
problem.
Below are the implementation details of the algorithm:
1. Initially the k number of centroids are chosen randomly from the given input data. For the
purpose of this project demo, we have also given an option to take the initial centroids from
the user.
2. Now, for each of the input data point, we measure the distance between the data point and
each of the k-chosen centroids. The index of the centroid that is at a minimum distance from
the data point is assigned as the clustering index to that data point.
3. Once all the data points have been assigned a cluster index, new centroids are calculated
which is the mean of all data points belonging to the same cluster index.
4. This process is continued till the centroids are converged or the difference between old and
new centroid becomes zero.

After the complete run of the algorithm on input data, we have compared the ground truth
clusters and algorithmic clusters based on the Jaccard Coefficient.

Visualization for Cho dataset:


The initial centroids chosen for the above plots are from gene-id’s (5, 30, 35, 101, 135).
The total number of iterations taken to converge is 21.
We have used Jaccord Coefficient for the external index. Jaccard Coefficient for the Cho
dataset is 0.4066834

Visualization for Iyer dataset:


The initial centroids chosen for the above plots are from gene-id’s (50, 100, 150, 200, 250).
The total number of iterations taken to converge is 34.
We have used Jaccord Coefficient for the external index. Jaccard Coefficient for the Cho
dataset is 0.3493445

Pros of k-Means Clustering:


 Easy to implement
 Clusters have similar density
Cons of k-Means Clustering:
 Difficult to predict the number of clusters.
 Initial centroids have a strong impact on the final results.
 k-Means has problems when clusters are of differing sizes and irregular shapes.
 Sensitive to scale.
Findings from the experiment:
 k-Means is giving the best Jaccord coefficient compared to other two clustering
techniques i.e. Hierarchical and DBSCAN clustering.
 It seems, for small values of k, k-Means seems to be computationally faster than
hierarchical clustering.
 From the plots, we can observe that k-Means is producing tighter clusters than
hierarchical clustering.
 From the above plots, we can notice that k-Means fails to detect outliers and adds the
outliers into one of the clusters.
 The initial centroids seem to affect the Jaccord coefficient with values ranging from 0.3-
0.45.
 k-Means seems to have worked well on Cho data set which can be due to the reason
that clusters in Cho dataset are more regular compared to Iyer dataset.
k-Means using Hadoop:
Hadoop Map-Reduce is basically a framework for parallel processing of large datasets.
Map-Reduce refers to two different tasks that Hadoop performs.

The Hadoop framework splits the given input dataset into a number of blocks to be processed
by a mapper.
1) Map: Mapper task takes the input data and emits intermediate <key, value> pairs, where key
and value changes based on the algorithm we are trying to implement. The Hadoop Map-
Reduce framework creates one map task for each of the input blocks.
Now the output from the Mappers go through different phases of the framework like the
combine phase, partition phase, sort phase, shuffle phase and then the intermedate results are
passed as an input to the reducer.
2) Reduce : The task of the reduce is to collect the data points sharing same key and convert
them into a smaller set of data points which depends on the algorithm we are trying to
implement.

In our implementation of the k-means algorithm, below are the tasks performed by Hadoop
Map-Reduce framework at different phases:
We have used Java for Hadoop Map-Reduce. We have a global map which has the old centroid
as the key and the new centroid as the value
1) Driver:
 The map is initialized with the initial centroids from the "centroids.txt" file
 A proper job configuration is created which is submitted by the job client to the
resource manager of the Hadoop framework, which distributes the job configuration to
the different slave nodes of the architecture.
 The job configuration will include details regarding the mapper, combiner, reducer
classes.
 Once the mapper and reducer jobs are done, we check if the centroids have converged.
 Once the centroids are converged, the map-reduce task is done. The output from the
reducer is used to calculate the Jaccard coefficient.
 For visualizing the data, we write the clustering indexes of each data points into a file
that is used by a python program for plotting.
2) Mapper:
 The input to the mapper is each data point from the input data file.
 We use the new centroids that are stored as values in the global map.
 Distance between the current data point and each of the centroids is measured.
 The output of the mapper is the centroid that is at a minimum distance from the data
point and the data point itself.
3) Combiner:
 The input to the combiner is a key/value pair. Centroid is the key and the value contains
list of all data points belonging to the centroid mentioned in the key.
 We combine all the data points that share the same centroid. We construct a string that
contains comma separated gene-id's followed by comma separated ground truth values
followed by the average attribute values of all the data points belonging to the same
centroid.
 The output of the combiner is same as the output of the mapper, except that the we are
sending the partially calculated centroid values, thereby reducing the task of the
reducer.
 This way we are trying to improve the efficiency or running time.
4) Reducer:
 The input to the reducer is a key/value pair. Centroid is the key and the value contains
list of all data points belonging to the centroid mentioned in the key. The value here
may also be the output of the combiner in which case we have to parse it to properly
calculate the new centroid based on the partial centroid values passed by the combiner.
 The new centroid is calculated based on all the data points received and the global map
is changed accordingly.
 The new centroid values are written to the output file along with all gene-id's associated
with the new centroids.

Visualization for Cho dataset:


The initial centroids chosen for the above plots are from gene-id’s (30, 60, 90, 120, 200).
The total number of iterations taken to converge is 17.
We have used Jaccord Coefficient for the external index. Jaccard Coefficient for the Cho
dataset is 0.3797285

Visualization for Iyer dataset:


The initial centroids chosen for the above plots are from gene-id’s (60, 120, 180, 240, 300).
The total number of iterations taken to converge is 27.
We have used Jaccord Coefficient for the external index. Jaccard Coefficient for the Cho
dataset is 0.3000499

Pros of Hadoop k-Means Clustering:


 As we are performing k-means in Hadoop Map-Reduce framework, the execution
happens in parallel, thereby decreasing the running time. It can clearly be observed with
large data sets.
 We have scope to further increase the efficiency of the algorithm by implementing
various phases such as partitioning, sorting and combining.
 The Map Reduce based solution of Hadoop in general is highly scalable, flexible, cost-
effective and secure.

Cons of Hadoop k-Means Clustering:


 In general, Hadoop Map-Reduce framework is not efficient for iterative processing.
Though it seems to be performing well when compared with normal k-means
implementation, it is not efficient enough as k-means is a iterative process.
 The main disadvantage of Map-Reduce is that, it is not always easy to implement every
algorithm using it.
 In real-time processing, where intermediate processes need to talk to other processes,
Map-Reduce does not do a great job as the Mapper and Reducer tasks are isolated and
distinct.
Findings from the experiment:
 Hadoop k-Means is giving the best Jaccord coefficient compared to other two clustering
techniques i.e. Hierarchical and DBSCAN clustering.
 Hadoop k-Means is giving the same Jaccord coefficient as normal k-means clustering.
 It seems, for small values of k, k-Means seems to be computationally faster than
hierarchical clustering.
 Use of customized combiner, partitioner and sort classes may further improve the
running time of the algorithm.
 For the given datasets, the non-parallel k-Means implementation ran slightly faster than
the Hadoop k-Means.

Hierarchical Agglomerative Clustering:


Hierarchical clustering involves creating clusters that have a predetermined ordering from top
to bottom. There are two types of hierarchical clustering, Divisive and Agglomerative. In the
Agglomerative method, we assign each observation to its own cluster. Following are the
implementation details for the Hierarchical Agglomerative clustering with single link (Min):

1. Compute the distance matrix by calculating the Euclidean distance between all the data
points.
2. Initially each data point is a separate cluster.
3. Store all the clusters.
4. Using the distance matrix computed in step 1, find the two closest clusters with single
link i.e. closest pair of data objects belonging to different clusters.
5. Merge the two closest clusters computed in step 3.
6. Update the distance matrix such that the new merged cluster in step 4 has distances to
other clusters updated to the minimum of the two merged clusters. Update the
distances in the non-merged clusters as well based on the new merged cluster.
7. Store the updated clusters.
8. Go to step 3 until a single cluster remains.
9. Build the ground truth matrix using the true labels.
10. Retrieve the stored clusters data based on number of clusters you desire( e.g. if you
assume there are 10 clusters, retrieve the stored cluster data which has 10 clusters)
11. Build the cluster truth matrix using cluster data obtained in step 10.
12. Compute the Jaccord coefficient on the matrices obtained in step 9 and step 10 using
the below formula:

| M 11 |
Jaccard coefficient 
| M 11 |  | M 10 |  | M 01 |
Visualization for 5 clusters for the ground truth labels:

Visualization for 5 clusters for the cluster truth labels:

We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Cho
dataset is 0.2283949
Visualization for 10 clusters for the ground truth labels:

Visualization for 10 clusters for the cluster truth labels:

We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Iyer
dataset is 0.15486
Pros of Hierarchical Agglomerative Clustering:
 Any desired number of clusters can be obtained by cutting the dendrogram at the
proper level.

Cons of Hierarchical Agglomerative Clustering:


 Sensitivity to noise and outliers
 Difficulty handling different sized clusters and irregular shapes
 Breaking large clusters.
 Intermediate decision cannot be undone.
Findings from the experiment:
 The Jaccard coefficients obtained from HCA are less than the values obtained through K-
means and DBSCAN.
 HCA single link is sensitive to noise and outliers that can be figured from the plots on
Iyer datasets.
 From the above plots, we can clearly notice that HCA could not identify the clusters
properly which can be attributed to the fact that true clusters are irregular in shape.
 The Jaccord coefficient of Cho dataset is higher than Iyer dataset because the clusters in
Iyer dataset are more irregular.
 Another reason for the Jaccord coefficient of Iyer dataset being less is the outliers and
the outlier in the Iyer dataset has impacted the Jaccord coefficient.

Density Based Clustering:


It is a data clustering algorithm, given a set of points in some space, it groups together points
that are closely packed together (points with many nearby neighbors), marking as outliers
points that lie alone in low-density regions. Following are the implementation details for the
density based clustering (DBSCAN):

1. Compute the distance matrix by calculating the Euclidean distance between all the data
points.
2. Initialize clusterSize to 0.
3. For each data point:
a. Check if it is visited or not
b. If visited, go to step 3.
c. Else, mark it as visited and compute its neighborhood points within the radius of
e.
i. If the number of neighbor points is less than MinPts , mark it as noise and
go to step 3.
ii. Else, add the current point to the new cluster and add all the
neighborhood points into a queue and increment the clusterSize by 1.
iii. Pop each value from the queue, and check if it is visited or not. If not
visited, mark it as visited and compute its neighborhood points within the
radius of e. If the number of points is greater than or equal to MinPts,
add all the neighborhood points to the queue.
iv. If the popped point from the queue is not added into any cluster, add the
point to the current cluster.
v. Repeat steps iii and iv until the queue is empty.
4. Build the ground truth matrix using the true labels.
5. Build the cluster truth matrix using cluster data obtained in step 3.
6. Compute the Jaccord coefficient on the matrices obtained in step 9 and step 10 using
the below formula:
| M 11 |
Jaccard coefficient 
| M 11 |  | M 10 |  | M 01 |

Visualization for Cho dataset with є = 0.98 and MinPts = 4 for the ground truth labels:

Visualization for Cho dataset with є = 0.98 and MinPts = 4 for the cluster truth labels:

We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Cho
dataset is 0.2088777
Visualization for Iyer dataset with є = 0.98 and MinPts = 4 for the ground truth labels:

Visualization for Iyer dataset with є = 0.98 and MinPts = 4 for the cluster truth labels:

We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Iyer
dataset is 0.2826085
Pros of DBSCAN Clustering:
1. DBSCAN does not require one to specify the number of clusters in the data a priori, as
opposed to k-means.
2. DBSCAN can find arbitrarily shaped clusters.
3. Resistant to Noise.
Cons of DBSCAN Clustering:
1. Cannot handle varying densities.
2. Sensitive to parameters i.e. hard to determine the correct set of parameters.

Findings from the experiment:


1. The Jaccord coefficients obtained in DBCSAN are less compared to the K-Means which
can be due to the data being less dense.
2. Different values of є and MinPts gives different number of clusters.
3. If є is fixed and MinPts is increased, the number of clusters tend to decrease and
number of outliers tend to increase because the number of points in the neighborhood
is constant but the MinPts is increased.
4. If MinPts is fixed and є is increased, the number of outliers tend to decrease and
number of clusters tend to decrease because the number of points in the neighborhood
increases.
5. If є is fixed and MinPts is decreased or if MinPts is fixed and є is increased , the number
of clusters and number of outliers can increase of decrease depending on the values of є
and MinPts
6. We are fixing the є and MinPts, which means that we are fixing the density and hence
this method cannot handle varying densities.
7. Upon testing with various values of є and MinPts, the near optimal value of Jaccord
efficient is obtained at є = 0.98 and MinPts = 4 for both the datasets.
8. Upon observing the plots for Cho dataset for HCA and DBSCAN, we can clearly notice
that DBSCAN is able to find the arbitrarily shaped clusters in a much better way.
9. The DBSCAN seems to work better for the Iyer dataset which can be due to the less
density variation in Iyer dataset.

Comparison of all the clustering algorithms:


 From the above results and plots, we can conclude that k-Means and Haddop k-Means
gave the best results among all other clustering algorithms.
 The clustering results of HAC seems to be uniform which can be attributed to the fact
that is very sensitive to noise.
 The density based clustering did not perform that well, but performed better than HAC
which can be due to the reason that densities are not uniform.
 Finally we can say that performance depends on various factors like size,type of
datasets and the density and shape of clusters in them.

You might also like