You are on page 1of 6

Machine Learning Minors

Experiment 10

Name: Deep Prajapati

SAP ID: 60004220262

Batch: C32

1
Aim: Explore K means clustering on given datasets.

Theory:
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It
is presumptively known how many clusters there are. It is also known as the flat clustering algorithm.
The number of clusters found from data by the method is denoted by the letter ‘K’ in K-means.

In this method, data points are assigned to clusters in such a way that the sum of the squared distances
between the data points and the centroid is as small as possible. It is essential to note that reduced
diversity within clusters leads to more identical data points within the same cluster.

The following stages will help us understand how the K-Means clustering technique works-

 Step 1: First, we need to provide the number of clusters, K, that need to be generated by this
algorithm.

 Step 2: Next, choose K data points at random and assign each to a cluster. Briefly, categorize the
data based on the number of data points.

 Step 3: The cluster centroids will now be computed.

 Step 4: Iterate the steps below until we find the ideal centroid, which is the assigning of data
points to clusters that do not vary.

 4.1 The sum of squared distances between data points and centroids would be calculated first.

 4.2 At this point, we need to allocate each data point to the cluster that is closest to the others
(centroid).

 4.3 Finally, compute the centroids for the clusters by averaging all of the cluster’s data points.

When using the K-means algorithm, we must keep the following points in mind:

 It is suggested to normalize the data while dealing with clustering algorithms such as K-Means
since such algorithms employ distance-based measurement to identify the similarity between
data points.

 Because of the iterative nature of K-Means and the random initialization of centroids, K-Means
may become stuck in a local optimum and fail to converge to the global optimum. As a result, it
is advised to employ distinct centroids’ initializations.

2
Collab Link: https://colab.research.google.com/drive/1TLEPYAg7LruvFjTPuLLKXrW9SlKDAX-O

Lab Assignments to complete in this session:

Use the given dataset and perform the following tasks:


Dataset 1: Synthetic Data (200 samples, 3 clusters and cluster_std = 2.7)

Task 1: Perform Kmeans clustering on Dataset 1 with random initialisation, 10 variations of initial means,
300 iteration). Find Lowest SSE value, Final location of centroids and number of iterations to converge.
Show the predicted labels for first 10 points.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Generate dataset
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)

# Plot the dataset


plt.scatter(X[:, 0], X[:, 1])
plt.title("Generated dataset")

# Function to calculate Euclidean distance


def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))

# KMeans class
class KMeans:
def __init__(self, K=3, max_iters=100, plot_steps=False):
self.K = K
self.max_iters = max_iters
self.plot_steps = plot_steps

# List of sample indices for each cluster


self.clusters = [[] for _ in range(self.K)]
# Mean feature vector for each cluster
self.centroids = []

def predict(self, X):


self.X = X
self.n_samples, self.n_features = X.shape

# Initialize centroids
random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
self.centroids = [self.X[idx] for idx in random_sample_idxs]

3
# Optimize clusters
for _ in range(self.max_iters):
# Assign samples to closest centroids (create clusters)
self.clusters = self._create_clusters(self.centroids)
if self.plot_steps:
self.plot()

# Calculate new centroids from the clusters


centroids_old = self.centroids
self.centroids = self._get_centroids(self.clusters)
if self.plot_steps:
self.plot()

# Check if clusters have changed


if self._is_converged(centroids_old, self.centroids):
break

# Classify samples as the index of their clusters


return self._get_cluster_labels(self.clusters)

def _get_cluster_labels(self, clusters):


# Each sample will get the label of the cluster it was assigned to
labels = np.empty(self.n_samples)

for cluster_idx, cluster in enumerate(clusters):


for sample_index in cluster:
labels[sample_index] = cluster_idx
return labels

def _create_clusters(self, centroids):


# Assign the samples to the closest centroids to create clusters
clusters = [[] for _ in range(self.K)]
for idx, sample in enumerate(self.X):
centroid_idx = self._closest_centroid(sample, centroids)
clusters[centroid_idx].append(idx)
return clusters

def _closest_centroid(self, sample, centroids):


# Distance of the current sample to each centroid
distances = [euclidean_distance(sample, point) for point in centroids]
# Get the closest centroid
closest_index = np.argmin(distances)
return closest_index

def _get_centroids(self, clusters):


# Assign mean value of clusters to centroids
centroids = np.zeros((self.K, self.n_features))

4
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids

def _is_converged(self, centroids_old, centroids):


# distances between each old and new centroids, fol all centroids
distances = [euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)]
return sum(distances) == 0

def plot(self):
fig, ax = plt.subplots(figsize=(12, 8))

for i, index in enumerate(self.clusters):


point = self.X[index].T
ax.scatter(*point)

for point in self.centroids:


ax.scatter(*point, marker="x", color='black', linewidth=2)

plt.show()

# Initialize and fit KMeans


kmeans = KMeans(K=3, max_iters=300, plot_steps=False)
y_pred = kmeans.predict(X)

# Show the predicted labels for first 10 points


print("Predicted labels:", y_pred[:10])

# Show the final location of centroids


print("Final location of centroids:", kmeans.centroids)

5
6

You might also like