You are on page 1of 9

B.

Tech (R&A)
Semester: 7 Subject: Machine Learning
Name: Pratik Mane Class: R&A Final Year
Roll No: PA66 Batch:

Experiment No: 06

Name of the Experiment: K Means Clustering

Performed on: Marks Teacher’s Signature with date

Submitted on:

Aim: To form clusters of given data using k-means clustering

Pre-requisite: concept of similarity measure, unsupervised learning

Objective:
1. To solve clustering example by using k-means algorithm
2. To implement the same using python

Components and equipment required:


PC with windows 7 or onwards and Anaconda distribution with Python 3.7

www.mitwpu.edu.in
Expt. 6
(The following part to be solved in the colab notebook)

Conclusion:

Post-lab questions:
1. State applications of k-means clustering algorithm.

Additional Links for reference:

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf

www.mitwpu.edu.in
Expt. 6
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

Name: Pratik Mane


Roll Number: PA-66
Batch: B-23
PRN Number: 1032201624
Machine Learning Experiment -6: K-Means Clustering Algorithm

from sklearn.cluster import KMeans


import pandas as pd
from·matplotlib·import·pyplot·as·plt
%matplotlib·inline

# import dataset and add columm names to the imporated dataset


df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Experiment 6 K means/K-Means_data

df.head()

Name Age Income ($)

0 Rob 27 70000

1 Michael 29 90000

2 Mohan 29 61000

3 Ismail 28 60000

4 Kory 42 150000

plt.scatter(df['Age'], df['Income ($)'])

<matplotlib.collections.PathCollection at 0x7f899eebc110>

# Choosing value of k
km = KMeans(n_clusters=3)
km

https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 1/7
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

KMeans(n_clusters=3)

# Here name column is excluded since it is string and will not be used in numeric computat
# Fit and predict
y_predicted = km.fit_predict(df[['Age','Income ($)']])
y_predicted

array([1, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2],
dtype=int32)

df['Cluster'] = y_predicted
df.head()

Name Age Income ($) Cluster

0 Rob 27 70000 1

1 Michael 29 90000 1

2 Mohan 29 61000 2

3 Ismail 28 60000 2

4 Kory 42 150000 0

km.cluster_centers_

array([[3.82857143e+01, 1.50000000e+05],
[3.40000000e+01, 8.05000000e+04],
[3.29090909e+01, 5.61363636e+04]])

# Dividing the dataset according to 3 cluster in three parts


df1 = df[df.Cluster==0]
df2 = df[df.Cluster==1]
df3 = df[df.Cluster==2]

plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')


plt.scatter(df2.Age, df2['Income ($)'], color='red', label='Red Cluster')
plt.scatter(df3.Age, df3['Income ($)'], color='black', label='Black Cluster')

plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
# From the graph we can conclude that our scaling is not perfect that is 16000 versus 43
# Therefore we need to use minmax scaler to implement k-means algorithm properly.

https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 2/7
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

<matplotlib.legend.Legend at 0x7f899eeafc50>

# Dividing the dataset according to 3 cluster in three parts

df1 = df[df.Cluster==0]
df2 = df[df.Cluster==1]
df3 = df[df.Cluster==2]

plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')


plt.scatter(df2.Age, df2['Income ($)'], color='red', label='Red Cluster')
plt.scatter(df3.Age, df3['Income ($)'], color='black', label='Black Cluster')

plt.xlabel('Age')
plt.ylabel('Income ($)')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='purple',marker='*',


plt.legend()
# From the graph we can conclude that our scaling is not perfect that is 16000 versus 43
# Therefore we need to use minmax scaler to implement k-means algorithm properly.

<matplotlib.legend.Legend at 0x7f899f4a7dd0>

# We are scaling Age and Income feature in the range between 0 to 1.


from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[['Income ($)']])
df[['Income ($)']] = scaler.transform(df[['Income ($)']])

scaler.fit(df[['Age']])
df[['Age']] = scaler.transform(df[['Age']])
df

https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 3/7
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

Name Age Income ($) Cluster

0 Rob 0.058824 0.213675 1

1 Michael 0.176471 0.384615 1

2 Mohan 0.176471 0.136752 2

3 Ismail 0.117647 0.128205 2

4 Kory 0.941176 0.897436 0

5 Gautam 0.764706 0.940171 0

6 David 0.882353 0.982906 0

7 Andrea 0.705882 1.000000 0

8 Brad 0.588235 0.948718 0

9 Angelina 0.529412 0.726496 0

10 Donald 0.647059 0.786325 0

11 Tom 0.000000 0.000000 2

12 Arnold 0.058824 0.025641 2

13 Jared 0.117647 0.051282 2

14 Stark 0.176471 0.038462 2

15 Ranbir 0.352941 0.068376 2

16 Dipak 0.823529 0.170940 2

17 Priyanka 0.882353 0.153846 2

18 Nick 1.000000 0.162393 2

19 Alia 0.764706 0.299145 1

20 Sid 0.882353 0.316239 1


# Use K-means
21 algorithm
Abdul to train0.111111
0.764706 scaled dataset2 of Age and Income ($)
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income ($)']])
y_predicted

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
dtype=int32)

df['Cluster'] = y_predicted
df.drop('Cluster' , axis='columns')
df

https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 4/7
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

Name Age Income ($) Cluster

0 Rob 0.058824 0.213675 1

1 Michael 0.176471 0.384615 1

2 Mohan 0.176471 0.136752 1

3 Ismail 0.117647 0.128205 1

4 Kory 0.941176 0.897436 2

5 Gautam 0.764706 0.940171 2

6 David 0.882353 0.982906 2

7 Andrea 0.705882 1.000000 2

8 Brad 0.588235 0.948718 2

9 Angelina 0.529412 0.726496 2

10 Donald 0.647059 0.786325 2

11 Tom 0.000000 0.000000 1

12 Arnold 0.058824 0.025641 1

13 Jared 0.117647 0.051282 1

14 Stark 0.176471 0.038462 1

15 Ranbir 0.352941 0.068376 1

16 Dipak 0.823529 0.170940 0

17 Priyanka 0.882353 0.153846 0

18 Nick 1.000000 0.162393 0

19 Alia 0.764706
km.cluster_centers_ 0.299145 0

20 Sid 0.882353
array([[0.85294118, 0.316239
0.2022792 ], 0
[0.1372549 , 0.11633428],
21 Abdul 0.764706 0.111111 0
[0.72268908, 0.8974359 ]])

df1 = df[df.Cluster==0]
df2 = df[df.Cluster==1]
df3 = df[df.Cluster==2]

plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')


plt.scatter(df2.Age, df2['Income ($)'], color='red', label='Red Cluster')
plt.scatter(df3.Age, df3['Income ($)'], color='black', label='Black Cluster')

plt.xlabel('Age')
plt.ylabel('Income ($)')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='purple',marker='*',


plt.legend()
# From the graph we can conclude that our scaling is not perfect that is 16000 versus 43
# Therefore we need to use minmax scaler to implement k-means algorithm properly.
https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 5/7
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

<matplotlib.legend.Legend at 0x7f899f176c90>

k_rng = range(1,10)
sse = []
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['Age','Income ($)']])
sse.append(km.inertia_)

sse

[5.434011511988178,
2.091136388699078,
0.4750783498553096,
0.3491047094419566,
0.26640301246684156,
0.21066678488010523,
0.17681044133887713,
0.13265419827245162,
0.10497488680620906]

plt.xlabel('k')
plt.ylabel('Sum of Squared Error')
plt.plot(k_rng,sse)

https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 6/7
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory

[<matplotlib.lines.Line2D at 0x7f899ed77890>]

Post Lab Questions and Conclusion


Post Lab Question
Question] State applications of k-means clustering algorithm.
Answer] kmeans algorithm is very popular and used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal
usually when we undergo a cluster analysis is either: Get a meaningful intuition of the structure
of the data.
Clustering technique is used in various applications such as market research and customer
segmentation, biological data and medical imaging, search result clustering, recommendation
engine, pattern recognition, social network analysis, image processing, etc.

Conclusion: KMeans Algorithm was performed on dataset having Income and Age as input
attributes rstly the data was scaled to the range of 0 to 1 for both the input attributes after that
the value of k was chosen based on Elbow curve method, it came to be 3 and was later veri ed
at the end also visually the user can say that there would be 3 clusters formed based on
obsersing the plot. Finally the dataset was re ned two times to obtain the most accurate
clustering of Data set.

Colab paid products - Cancel contracts here

https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 7/7

You might also like