You are on page 1of 6

Department of Electronics & Telecommunications

Engineering
ETEL71A-Machine Learning and AI
Class: BE
Name: Adya Kastwar
UID : 2016120024
Sem: VII
Experiment: Unsupervised learning algorithms

Objective: Apply EM algorithm to cluster a set of data stored in a .CSV file. Use the same data set
for clustering using k-Means algorithm. Compare the results of these two algorithms
and comment on the quality of clustering. You can make use of /Python ML library
classes/API in the program.
Outcomes:
1. Understand unsupervised, semi-supervised learning and the methods of clustering.
2. Understand Expectation Maximization algorithms to maximize the likelihoods.
3. Apply K-means algorithm in Python to form clusters of unlabelled data.
4. Apply Hierarchical algorithm to form clusters and plot a dendogram.
5. Compare the unsupervised learning algorithms.

System Requirements:
Linux OS with Python and libraries or R or windows with MATLAB

Task 1: Describe the clustering algorithms:


a. There are mainly 4 methods of clustering in unsupervised learning: Exclusive (K-means),
Overlapping (Fuzzy K-means), Hierarchical (Agglomerative, Divisive) and Probabilistic
(Expectation Maximization).
1) K-means
 Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to
only one group.
 It tries to make the inter-cluster data points as similar as possible while also keeping the
clusters as different (far) as possible.
 It assigns data points to a cluster such that the sum of the squared distance between the
data points and the cluster’s centroid (arithmetic mean of all the data points that belong
to that cluster) is at the minimum.
 The less variation we have within clusters, the more homogeneous (similar) the data
points are within the same cluster.
2) Fuzzy K-means
 Unlike K-Means, which seeks hard cluster, wherein each of the points belongs to one
cluster, Fuzzy K-Means seeks the softer clusters for overlapping.
 A single point in a soft cluster can belong to more than one cluster with a certain affinity
value towards each of the points.
 The affinity is in proportion with the distance of that point from the cluster centroid.
 Similar to K-Means, Fuzzy K-Means works on the objects that have the distance measure
defined and can be represented in the n-dimensional vector space.
3) Agglomerative Hierarchical clustering Technique: In this technique, initially each data
point is considered as an individual cluster. At each iteration, the similar clusters merge
with other clusters until one cluster or K clusters are formed.
4) Divisive Hierarchical clustering: We consider all the data points as a single cluster and in
each iteration, we separate the data points from the cluster which are not similar. Each
data point which is separated is considered as an individual cluster. In the end, we’ll be
left with n clusters.
5) Expectation maximization :
 EM is an iterative process that begins with a "naive" or random initialization and then
alternates between the expectation and maximization steps until the algorithm reaches
convergence.
 We try to find a number of gaussian distributions which can be used to describe the
shape of our dataset.
 A critical point for the understanding is that these gaussian shaped clusters must not be
circular shaped as for instance in the KNN approach but can have all shapes a
multivariate Gaussian distribution can take.
b. Note down the steps of K-means algorithm and Hierarchical agglomerative algorithm for
clustering. State the methods to find similarities between the clusters.
K means:
1. Specify number of clusters K.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.
 Compute the sum of the squared distance between data points and all centroids.
 Assign each data point to the closest cluster (centroid).
 Compute the centroids for the clusters by taking the average of the all data points that belong
to each cluster.
Hierarchical agglomerative :
 Compute the proximity matrix
 Let each data point be a cluster
 Repeat: Merge the two closest clusters and update the proximity matrix
 Until only a single cluster remains
Key operation is the computation of the proximity of two clusters

Task 2: Write a python code to implement K-means and Agglomerative algorithm to form clusters for
the ‘Iris’ flower data. Assume the value of ‘K’ from the number of classes and remove the
class column to use the data for unsupervised learning.

#kmeans
#adya
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#dataset and features


dataset = pd.read_csv('iris.data')
X = dataset.iloc[:, [2, 3]].values

#kmeans and finding value of k


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)


plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

#fitting kmeans model


kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

#plot clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label =
'Centroids')
plt.title('Clusters of flowers species')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend()
plt.show()
#hierarchical agglomerative
#Adya Kastwar
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#reading dataset, features


dataset = pd.read_csv('iris.data')
X = dataset.iloc[:, [2, 3]].values

#dendogram
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('flower')
plt.ylabel('Euclidean distances')
plt.show()

#fit model
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

#plot clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
#plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
#plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.title('Clusters of flowers')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend()
plt.show()
Task 3: Plot a dendogram for agglomerative clustering, plot the clusters formed and compare
iterations required by both models.
Task 4: State the pros and cons of both the algorithms.
Kmeans
Pros:
1) Fast, robust and easier to understand.
2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each object,
and t  is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each other.
Cons :
1) The learning algorithm requires apriori specification of the number of  cluster centers.
2) The use of  Exclusive Assignment - If  there are two highly overlapping data then k-means will not
be able to resolve that there are two clusters.
3) Euclidean distance measures can unequally weight underlying factors.
4) Unable to handle noisy data and outliers.
Agglomerative hieracrchical
Pros :
 Flexible
 Useful for smaller data
Cons :
 Not very scalable
 Cannot be used on large data

Dataset: Iris flower data set with ‘petal length’ and ‘petal width’ attributes to limit it in 2-dimensions.

Conclusion:
Both the algorithms look for similarities among data and both use the same approaches to decide
the number of clusters.
Both give almost similar accuracy of clustering.
Kmeans requires prior specification of k.
Elbow method was used to find out an approximate value of k for k means.

You might also like