DS - ML - 7 - 60019210046 1

Machine Learning Minors
Experiment 10
Name: Deep Prajapati
SAP ID: 60004220262
Batch: C32
1
Aim: Explore K means clustering on given datasets.
Theory:
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It
is presumptively known how many clusters there are. It is also known as the flat clustering algorithm.
The number of clusters found from data by the method is denoted by the letter ‘K’ in K-means.
In this method, data points are assigned to clusters in such a way that the sum of the squared distances
between the data points and the centroid is as small as possible. It is essential to note that reduced
diversity within clusters leads to more identical data points within the same cluster.
The following stages will help us understand how the K-Means clustering technique works-
 Step 1: First, we need to provide the number of clusters, K, that need to be generated by this
algorithm.
 Step 2: Next, choose K data points at random and assign each to a cluster. Briefly, categorize the
data based on the number of data points.
 Step 3: The cluster centroids will now be computed.
 Step 4: Iterate the steps below until we find the ideal centroid, which is the assigning of data
points to clusters that do not vary.
 4.1 The sum of squared distances between data points and centroids would be calculated first.
 4.2 At this point, we need to allocate each data point to the cluster that is closest to the others
(centroid).
 4.3 Finally, compute the centroids for the clusters by averaging all of the cluster’s data points.
When using the K-means algorithm, we must keep the following points in mind:
 It is suggested to normalize the data while dealing with clustering algorithms such as K-Means
since such algorithms employ distance-based measurement to identify the similarity between
data points.
 Because of the iterative nature of K-Means and the random initialization of centroids, K-Means
may become stuck in a local optimum and fail to converge to the global optimum. As a result, it
is advised to employ distinct centroids’ initializations.
2
Collab Link: https://colab.research.google.com/drive/1TLEPYAg7LruvFjTPuLLKXrW9SlKDAX-O
Lab Assignments to complete in this session:
Use the given dataset and perform the following tasks:

Dataset 1: Synthetic Data (200 samples, 3 clusters and cluster_std = 2.7)
Task 1: Perform Kmeans clustering on Dataset 1 with random initialisation, 10 variations of initial means,
300 iteration). Find Lowest SSE value, Final location of centroids and number of iterations to converge.
Show the predicted labels for first 10 points.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Generate dataset
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)
# Plot the dataset

plt.scatter(X[:, 0], X[:, 1])
plt.title("Generated dataset")
# Function to calculate Euclidean distance

def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
# KMeans class
class KMeans:
def __init__(self, K=3, max_iters=100, plot_steps=False):
self.K = K
self.max_iters = max_iters
self.plot_steps = plot_steps
# List of sample indices for each cluster

self.clusters = [[] for _ in range(self.K)]
# Mean feature vector for each cluster
self.centroids = []
def predict(self, X):

self.X = X
self.n_samples, self.n_features = X.shape
# Initialize centroids
random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
self.centroids = [self.X[idx] for idx in random_sample_idxs]
3
# Optimize clusters
for _ in range(self.max_iters):
# Assign samples to closest centroids (create clusters)
self.clusters = self._create_clusters(self.centroids)
if self.plot_steps:
self.plot()
# Calculate new centroids from the clusters

centroids_old = self.centroids
self.centroids = self._get_centroids(self.clusters)
if self.plot_steps:
self.plot()
# Check if clusters have changed

if self._is_converged(centroids_old, self.centroids):
break
# Classify samples as the index of their clusters

return self._get_cluster_labels(self.clusters)
def _get_cluster_labels(self, clusters):

# Each sample will get the label of the cluster it was assigned to
labels = np.empty(self.n_samples)
for cluster_idx, cluster in enumerate(clusters):

for sample_index in cluster:
labels[sample_index] = cluster_idx
return labels
def _create_clusters(self, centroids):

# Assign the samples to the closest centroids to create clusters
clusters = [[] for _ in range(self.K)]
for idx, sample in enumerate(self.X):
centroid_idx = self._closest_centroid(sample, centroids)
clusters[centroid_idx].append(idx)
return clusters
def _closest_centroid(self, sample, centroids):

# Distance of the current sample to each centroid
distances = [euclidean_distance(sample, point) for point in centroids]
# Get the closest centroid
closest_index = np.argmin(distances)
return closest_index
def _get_centroids(self, clusters):

# Assign mean value of clusters to centroids
centroids = np.zeros((self.K, self.n_features))
4
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids
def _is_converged(self, centroids_old, centroids):

# distances between each old and new centroids, fol all centroids
distances = [euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)]
return sum(distances) == 0
def plot(self):
fig, ax = plt.subplots(figsize=(12, 8))
for i, index in enumerate(self.clusters):

point = self.X[index].T
ax.scatter(*point)
for point in self.centroids:

ax.scatter(*point, marker="x", color='black', linewidth=2)
plt.show()
# Initialize and fit KMeans

kmeans = KMeans(K=3, max_iters=300, plot_steps=False)
y_pred = kmeans.predict(X)
# Show the predicted labels for first 10 points

print("Predicted labels:", y_pred[:10])
# Show the final location of centroids

print("Final location of centroids:", kmeans.centroids)
5
6

DS - ML - 7 - 60019210046 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS - ML - 7 - 60019210046 1

Uploaded by

Copyright:

Available Formats

Machine Learning Minors

Name: Deep Prajapati

SAP ID: 60004220262

 Step 3: The cluster centroids will now be computed.

Lab Assignments to complete in this session:

Use the given dataset and perform the following tasks:

# Plot the dataset

# Function to calculate Euclidean distance

# List of sample indices for each cluster

def predict(self, X):

# Calculate new centroids from the clusters

# Check if clusters have changed

# Classify samples as the index of their clusters

def _get_cluster_labels(self, clusters):

for cluster_idx, cluster in enumerate(clusters):

def _create_clusters(self, centroids):

def _closest_centroid(self, sample, centroids):

def _get_centroids(self, clusters):

def _is_converged(self, centroids_old, centroids):

for i, index in enumerate(self.clusters):

for point in self.centroids:

# Initialize and fit KMeans

# Show the predicted labels for first 10 points

# Show the final location of centroids

You might also like