You are on page 1of 7

2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

𝐌𝐀𝐓𝐇 144 - 𝐈𝐍𝐓𝐑𝐎𝐃𝐔𝐂𝐓𝐈𝐎𝐍 𝐓𝐎 𝐃𝐀𝐓𝐀 𝐒𝐂𝐈𝐄𝐍𝐂𝐄


𝟸𝚀 𝚂𝚈𝟸𝟹𝟸𝟺
Instructor: EDGAR M. ADINA

k-Means Clustering

Clustering is a set of unsupervised learning algorithms. They are useful when we don’t have
any labels of the data, and the algorithms will try to find the patterns of the internal structure or
similarities of the data to put them into different groups. Since they are no labels (true answer)
associated with the data points, we can not use these extra bits of information to constrain the
problem. But instead, there are other ways that we can solve the problem, in this section, we
will take a look of a very popular clustering algorithm - K-means and understand.

K-means clustering, a method used for vector quantization, originally from signal processing,
𝑛 𝑘
that aims to partition observations into groups or clusters(usual notation)in which each
observation belongs to the cluster with the closest mean(cluster centers or cluster centroid),
serving as a prototype of the cluster.

The k-means clustering method is additionally used as an unsupervised machine learning


technique used to identify clusters of data objects in a dataset. There are many various kinds of
clustering methods, but k-means is one of the oldest and most approachable. In this lesson, we
need Python libraries “NumPy” and “Scikit-learn” to implement a K-Means clustering algorithm.
The simulated data will only have three clusters, which will be identified by the clustering
algorithm. The matter is computationally difficult (NP-hard). The unsupervised k-means
algorithm has a loose relationship to the k-nearest neighbor classifier, a well-liked supervised
machine learning technique for classification that’s often confused with k-means because of the
name.

Sample Case - Iris Dataset

The Iris dataset (iris.csv) is one of the earliest datasets used in the literature on classification
methods and widely used in statistics and machine learning. Each instance (row) is a plant.

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are not linearly separable from
each other.

Predicted attribute: class of iris plant.

Let us first import the needed tools.

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 1/7


2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

In [9]: import numpy as np


import matplotlib.pyplot as plt
from sklearn import datasets
plt.style.use('seaborn-poster')
%matplotlib inline

C:\Users\romel\AppData\Local\Temp\ipykernel_15560\4246467448.py:4: Matplotlib
DeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated s
ince 3.6, as they no longer correspond to the styles shipped by seaborn. Howe
ver, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, dir
ectly use the seaborn API instead.
plt.style.use('seaborn-poster')

We import the data. Be sure you have downloaded the dataset (iris.csv) from your Blackboard,
and uploaded in your Jupyter notebook.

In [3]: iris = datasets.load_iris()

Let us just use two features, so that we can easily visualize them.

In [4]: X = iris.data[:, [0, 2]]


y = iris.target
target_names = iris.target_names
feature_names = iris.feature_names

Now, we extract the classes.

In [5]: n_class = len(set(y))

Let us visualize the data first.

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 2/7


2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

In [6]: plt.figure(figsize = (10,8))



plt.scatter(X[:, 0], X[:, 1], \
color = 'b', marker = 'o', s = 60)

plt.xlabel('Feature 1 - ' + feature_names[0])
plt.ylabel('Feature 2 - ' + feature_names[2])
plt.show()

Now we can use the K-means by initializing the model and train the algorithm.

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 3/7


2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

In [8]: import os
os.environ["OMP_NUM_THREADS"] = '1'
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Assuming X, y, colors, symbols, target_names, feature_names are defined else

# Instantiate KMeans model
kmeans = KMeans(n_clusters=3) # Adjust the number of clusters as needed

# Fit the model to your data
kmeans.fit(X)

# Plot clusters
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

for i, (c, s) in enumerate(zip(colors, symbols)):
ix = kmeans.labels_ == i
ax.scatter(X[:, 0][ix], X[:, 1][ix], color=c, marker=s, s=60, label=target
loc = kmeans.cluster_centers_[i]
ax.scatter(loc[0], loc[1], color='k', marker=s, linewidth=5)

ix = y == i
ax2.scatter(X[:, 0][ix], X[:, 1][ix], color=c, marker=s, s=60, label=targe

plt.legend(loc=4, scatterpoints=1)
ax.set_xlabel('Feature 1 - ' + feature_names[0])
ax.set_ylabel('Feature 2 - ' + feature_names[2])
ax2.set_xlabel('Feature 1 - ' + feature_names[0])
ax2.set_ylabel('Feature 2 - ' + feature_names[2])
plt.tight_layout()
plt.show()

C:\Users\romel\PYTHON\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: Futu
reWarning: The default value of `n_init` will change from 10 to 'auto' in 1.
4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
C:\Users\romel\PYTHON\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: User
Warning: KMeans is known to have a memory leak on Windows with MKL, when ther
e are less chunks than available threads. You can avoid it by setting the env
ironment variable OMP_NUM_THREADS=1.
warnings.warn(

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 4/7


2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

The results of the found clusters are saved in the labels attribute and the centroids are in the
cluster_centers_. Let’s plot the clustering results and the real species in the following figure.
The left figure shows the clustering results with the bigger symbol as the centroids of the
clusters.

We can see from the above figure, the results are not too bad, they are actually quite similar to
the true classes. But remember, we get this results without the labels only based on the
similarities between data points. We can also predict new data points to the clusters using the
predict function. The following predict the cluster label for two new points.

In [11]: new_points = np.array([[5, 2], [6, 5]])


kmeans.predict(new_points)

Out[11]: array([0, 1])

SUMMARY

Machine learning are algorithms that have the capability to learn from data and generalize to
the new data.

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 5/7


2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

Machine learning have two main categories supervised learning and unsupervised learning. In
supervised learning, there are classification and regression, while in unsupervised learning,
there are clustering and dimensionality reduction.

The output of the classification tasks are categorical data.

The output of the regression tasks are quantity data.

Reflections

1. Discuss the significance of choosing the appropriate number of clusters and the impact it
had on the results.

Answer: Selecting the appropriate cluster number is critical in cluster analysis, shaping the
interpretability and accuracy of outcomes. Insufficient clusters may oversimplify the model,
missing important data nuances, while an excessive number can introduce noise, leading to
inaccurate conclusions. Consequently, the cluster count significantly affects the practical utility
of the analysis, influencing decision-making processes. Thus, determining the optimal number
of clusters is essential for deriving meaningful insights and ensuring the reliability of the results

2. Share your insights into how K-means clustering helped in uncovering patterns or
relationships within the data. How might the choice of features influence the clustering
outcome?

Answer: K-means clustering reveals data patterns by grouping similar points iteratively.
However, the clustering outcome heavily relies on feature selection. Relevant features improve
accuracy by capturing meaningful distinctions, yielding coherent clusters. Conversely, irrelevant
or redundant features introduce noise, undermining clustering. Thus, choosing informative
features that represent data effectively is vital for maximizing K-means clustering's
effectiveness and extracting valuable insights.

3. Elaborate on any practical applications or decision-making scenarios where K-means


clustering can be effectively employed

Answer: K-means clustering finds utility in diverse domains, facilitating decision-making


processes. It assists businesses in customer segmentation, enabling targeted marketing
approaches based on purchasing habits. In healthcare, it aids in grouping patients for
personalized treatment strategies. Financial sectors benefit from portfolio diversification through
clustering assets with comparable risk-return profiles. Furthermore, in image processing,
Kmeans segmentation delineates meaningful image regions. Overall, K-means clustering
proves versatile across practical applications, offering valuable insights for informed decision-
making.

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 6/7


2/8/24, 10:33 PM JAVIER_kMeans Clustering - Jupyter Notebook

localhost:8888/notebooks/Downloads/JAVIER_kMeans Clustering.ipynb 7/7

You might also like