Professional Documents
Culture Documents
Clustering Algorithms
Team Members
Chaitanya Vedurupaka(50205782)
Anirudh Yellapragada(50206970)
k-Means:
k-means is one of the unsupervised learning techniques which is used to solve the clustering
problem.
Below are the implementation details of the algorithm:
1. Initially the k number of centroids are chosen randomly from the given input data. For the
purpose of this project demo, we have also given an option to take the initial centroids from
the user.
2. Now, for each of the input data point, we measure the distance between the data point and
each of the k-chosen centroids. The index of the centroid that is at a minimum distance from
the data point is assigned as the clustering index to that data point.
3. Once all the data points have been assigned a cluster index, new centroids are calculated
which is the mean of all data points belonging to the same cluster index.
4. This process is continued till the centroids are converged or the difference between old and
new centroid becomes zero.
After the complete run of the algorithm on input data, we have compared the ground truth
clusters and algorithmic clusters based on the Jaccard Coefficient.
The Hadoop framework splits the given input dataset into a number of blocks to be processed
by a mapper.
1) Map: Mapper task takes the input data and emits intermediate <key, value> pairs, where key
and value changes based on the algorithm we are trying to implement. The Hadoop Map-
Reduce framework creates one map task for each of the input blocks.
Now the output from the Mappers go through different phases of the framework like the
combine phase, partition phase, sort phase, shuffle phase and then the intermedate results are
passed as an input to the reducer.
2) Reduce : The task of the reduce is to collect the data points sharing same key and convert
them into a smaller set of data points which depends on the algorithm we are trying to
implement.
In our implementation of the k-means algorithm, below are the tasks performed by Hadoop
Map-Reduce framework at different phases:
We have used Java for Hadoop Map-Reduce. We have a global map which has the old centroid
as the key and the new centroid as the value
1) Driver:
The map is initialized with the initial centroids from the "centroids.txt" file
A proper job configuration is created which is submitted by the job client to the
resource manager of the Hadoop framework, which distributes the job configuration to
the different slave nodes of the architecture.
The job configuration will include details regarding the mapper, combiner, reducer
classes.
Once the mapper and reducer jobs are done, we check if the centroids have converged.
Once the centroids are converged, the map-reduce task is done. The output from the
reducer is used to calculate the Jaccard coefficient.
For visualizing the data, we write the clustering indexes of each data points into a file
that is used by a python program for plotting.
2) Mapper:
The input to the mapper is each data point from the input data file.
We use the new centroids that are stored as values in the global map.
Distance between the current data point and each of the centroids is measured.
The output of the mapper is the centroid that is at a minimum distance from the data
point and the data point itself.
3) Combiner:
The input to the combiner is a key/value pair. Centroid is the key and the value contains
list of all data points belonging to the centroid mentioned in the key.
We combine all the data points that share the same centroid. We construct a string that
contains comma separated gene-id's followed by comma separated ground truth values
followed by the average attribute values of all the data points belonging to the same
centroid.
The output of the combiner is same as the output of the mapper, except that the we are
sending the partially calculated centroid values, thereby reducing the task of the
reducer.
This way we are trying to improve the efficiency or running time.
4) Reducer:
The input to the reducer is a key/value pair. Centroid is the key and the value contains
list of all data points belonging to the centroid mentioned in the key. The value here
may also be the output of the combiner in which case we have to parse it to properly
calculate the new centroid based on the partial centroid values passed by the combiner.
The new centroid is calculated based on all the data points received and the global map
is changed accordingly.
The new centroid values are written to the output file along with all gene-id's associated
with the new centroids.
1. Compute the distance matrix by calculating the Euclidean distance between all the data
points.
2. Initially each data point is a separate cluster.
3. Store all the clusters.
4. Using the distance matrix computed in step 1, find the two closest clusters with single
link i.e. closest pair of data objects belonging to different clusters.
5. Merge the two closest clusters computed in step 3.
6. Update the distance matrix such that the new merged cluster in step 4 has distances to
other clusters updated to the minimum of the two merged clusters. Update the
distances in the non-merged clusters as well based on the new merged cluster.
7. Store the updated clusters.
8. Go to step 3 until a single cluster remains.
9. Build the ground truth matrix using the true labels.
10. Retrieve the stored clusters data based on number of clusters you desire( e.g. if you
assume there are 10 clusters, retrieve the stored cluster data which has 10 clusters)
11. Build the cluster truth matrix using cluster data obtained in step 10.
12. Compute the Jaccord coefficient on the matrices obtained in step 9 and step 10 using
the below formula:
| M 11 |
Jaccard coefficient
| M 11 | | M 10 | | M 01 |
Visualization for 5 clusters for the ground truth labels:
We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Cho
dataset is 0.2283949
Visualization for 10 clusters for the ground truth labels:
We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Iyer
dataset is 0.15486
Pros of Hierarchical Agglomerative Clustering:
Any desired number of clusters can be obtained by cutting the dendrogram at the
proper level.
1. Compute the distance matrix by calculating the Euclidean distance between all the data
points.
2. Initialize clusterSize to 0.
3. For each data point:
a. Check if it is visited or not
b. If visited, go to step 3.
c. Else, mark it as visited and compute its neighborhood points within the radius of
e.
i. If the number of neighbor points is less than MinPts , mark it as noise and
go to step 3.
ii. Else, add the current point to the new cluster and add all the
neighborhood points into a queue and increment the clusterSize by 1.
iii. Pop each value from the queue, and check if it is visited or not. If not
visited, mark it as visited and compute its neighborhood points within the
radius of e. If the number of points is greater than or equal to MinPts,
add all the neighborhood points to the queue.
iv. If the popped point from the queue is not added into any cluster, add the
point to the current cluster.
v. Repeat steps iii and iv until the queue is empty.
4. Build the ground truth matrix using the true labels.
5. Build the cluster truth matrix using cluster data obtained in step 3.
6. Compute the Jaccord coefficient on the matrices obtained in step 9 and step 10 using
the below formula:
| M 11 |
Jaccard coefficient
| M 11 | | M 10 | | M 01 |
Visualization for Cho dataset with є = 0.98 and MinPts = 4 for the ground truth labels:
Visualization for Cho dataset with є = 0.98 and MinPts = 4 for the cluster truth labels:
We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Cho
dataset is 0.2088777
Visualization for Iyer dataset with є = 0.98 and MinPts = 4 for the ground truth labels:
Visualization for Iyer dataset with є = 0.98 and MinPts = 4 for the cluster truth labels:
We have used Jaccord Coefficient for the external index. Jaccord Coefficient for the Iyer
dataset is 0.2826085
Pros of DBSCAN Clustering:
1. DBSCAN does not require one to specify the number of clusters in the data a priori, as
opposed to k-means.
2. DBSCAN can find arbitrarily shaped clusters.
3. Resistant to Noise.
Cons of DBSCAN Clustering:
1. Cannot handle varying densities.
2. Sensitive to parameters i.e. hard to determine the correct set of parameters.