Pa66 ML Exp6

B.
Tech (R&A)
Semester: 7 Subject: Machine Learning
Name: Pratik Mane Class: R&A Final Year
Roll No: PA66 Batch:
Experiment No: 06
Name of the Experiment: K Means Clustering
Performed on: Marks Teacher’s Signature with date
Submitted on:
Aim: To form clusters of given data using k-means clustering
Pre-requisite: concept of similarity measure, unsupervised learning
Objective:
1. To solve clustering example by using k-means algorithm
2. To implement the same using python
Components and equipment required:

PC with windows 7 or onwards and Anaconda distribution with Python 3.7
www.mitwpu.edu.in
Expt. 6
(The following part to be solved in the colab notebook)
Conclusion:
Post-lab questions:
1. State applications of k-means clustering algorithm.
Additional Links for reference:
https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf
www.mitwpu.edu.in
Expt. 6
11/6/22, 1:40 PM PA66 EXP6 ML.ipynb - Colaboratory
Name: Pratik Mane

Roll Number: PA-66
Batch: B-23
PRN Number: 1032201624
Machine Learning Experiment -6: K-Means Clustering Algorithm
from sklearn.cluster import KMeans

import pandas as pd
from·matplotlib·import·pyplot·as·plt
%matplotlib·inline
# import dataset and add columm names to the imporated dataset

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Experiment 6 K means/K-Means_data
df.head()
Name Age Income ($)
0 Rob 27 70000
1 Michael 29 90000
2 Mohan 29 61000
3 Ismail 28 60000
4 Kory 42 150000
plt.scatter(df['Age'], df['Income ($)'])
<matplotlib.collections.PathCollection at 0x7f899eebc110>
# Choosing value of k
km = KMeans(n_clusters=3)
km
https://colab.research.google.com/drive/1holqy_3nnMUE2ghbRw32OHUPZxL4MAsb#scrollTo=lB169pK9cYmI&printMode=true 1/7
KMeans(n_clusters=3)
# Here name column is excluded since it is string and will not be used in numeric computat
# Fit and predict
y_predicted = km.fit_predict(df[['Age','Income ($)']])
y_predicted
array([1, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2],
dtype=int32)
df['Cluster'] = y_predicted
df.head()
Name Age Income ($) Cluster
0 Rob 27 70000 1
1 Michael 29 90000 1
2 Mohan 29 61000 2
3 Ismail 28 60000 2
4 Kory 42 150000 0
km.cluster_centers_
array([[3.82857143e+01, 1.50000000e+05],
[3.40000000e+01, 8.05000000e+04],
[3.29090909e+01, 5.61363636e+04]])
# Dividing the dataset according to 3 cluster in three parts

df1 = df[df.Cluster==0]
plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')

plt.scatter(df2.Age, df2['Income ($)'], color='red', label='Red Cluster')
plt.scatter(df3.Age, df3['Income ($)'], color='black', label='Black Cluster')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
# From the graph we can conclude that our scaling is not perfect that is 16000 versus 43
# Therefore we need to use minmax scaler to implement k-means algorithm properly.
<matplotlib.legend.Legend at 0x7f899eeafc50>
# Dividing the dataset according to 3 cluster in three parts

plt.xlabel('Age')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='purple',marker='*',

plt.legend()
<matplotlib.legend.Legend at 0x7f899f4a7dd0>
# We are scaling Age and Income feature in the range between 0 to 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[['Income ($)']])
df[['Income ($)']] = scaler.transform(df[['Income ($)']])
scaler.fit(df[['Age']])
df[['Age']] = scaler.transform(df[['Age']])
df
0 Rob 0.058824 0.213675 1
1 Michael 0.176471 0.384615 1
2 Mohan 0.176471 0.136752 2
3 Ismail 0.117647 0.128205 2
4 Kory 0.941176 0.897436 0
5 Gautam 0.764706 0.940171 0
6 David 0.882353 0.982906 0
7 Andrea 0.705882 1.000000 0
8 Brad 0.588235 0.948718 0
9 Angelina 0.529412 0.726496 0
10 Donald 0.647059 0.786325 0
11 Tom 0.000000 0.000000 2
12 Arnold 0.058824 0.025641 2
13 Jared 0.117647 0.051282 2
14 Stark 0.176471 0.038462 2
15 Ranbir 0.352941 0.068376 2
16 Dipak 0.823529 0.170940 2
17 Priyanka 0.882353 0.153846 2
18 Nick 1.000000 0.162393 2
19 Alia 0.764706 0.299145 1
20 Sid 0.882353 0.316239 1

# Use K-means
21 algorithm
Abdul to train0.111111
0.764706 scaled dataset2 of Age and Income ($)
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income ($)']])
y_predicted
array([1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
dtype=int32)
df['Cluster'] = y_predicted
df.drop('Cluster' , axis='columns')
df
0 Rob 0.058824 0.213675 1
1 Michael 0.176471 0.384615 1
2 Mohan 0.176471 0.136752 1
3 Ismail 0.117647 0.128205 1
4 Kory 0.941176 0.897436 2
5 Gautam 0.764706 0.940171 2
6 David 0.882353 0.982906 2
7 Andrea 0.705882 1.000000 2
8 Brad 0.588235 0.948718 2
9 Angelina 0.529412 0.726496 2
10 Donald 0.647059 0.786325 2
11 Tom 0.000000 0.000000 1
12 Arnold 0.058824 0.025641 1
13 Jared 0.117647 0.051282 1
14 Stark 0.176471 0.038462 1
15 Ranbir 0.352941 0.068376 1
16 Dipak 0.823529 0.170940 0
17 Priyanka 0.882353 0.153846 0
18 Nick 1.000000 0.162393 0
19 Alia 0.764706
km.cluster_centers_ 0.299145 0
20 Sid 0.882353
array([[0.85294118, 0.316239
0.2022792 ], 0
[0.1372549 , 0.11633428],
21 Abdul 0.764706 0.111111 0
[0.72268908, 0.8974359 ]])

plt.xlabel('Age')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='purple',marker='*',

plt.legend()
<matplotlib.legend.Legend at 0x7f899f176c90>
k_rng = range(1,10)
sse = []
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['Age','Income ($)']])
sse.append(km.inertia_)
sse
[5.434011511988178,
2.091136388699078,
0.4750783498553096,
0.3491047094419566,
0.26640301246684156,
0.21066678488010523,
0.17681044133887713,
0.13265419827245162,
0.10497488680620906]
plt.xlabel('k')
plt.ylabel('Sum of Squared Error')
plt.plot(k_rng,sse)
[<matplotlib.lines.Line2D at 0x7f899ed77890>]
Post Lab Questions and Conclusion

Post Lab Question
Question] State applications of k-means clustering algorithm.
Answer] kmeans algorithm is very popular and used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal
usually when we undergo a cluster analysis is either: Get a meaningful intuition of the structure
of the data.
Clustering technique is used in various applications such as market research and customer
segmentation, biological data and medical imaging, search result clustering, recommendation
engine, pattern recognition, social network analysis, image processing, etc.
Conclusion: KMeans Algorithm was performed on dataset having Income and Age as input
attributes rstly the data was scaled to the range of 0 to 1 for both the input attributes after that
the value of k was chosen based on Elbow curve method, it came to be 3 and was later veri ed
at the end also visually the user can say that there would be 3 clusters formed based on
obsersing the plot. Finally the dataset was re ned two times to obtain the most accurate
clustering of Data set.
Colab paid products - Cancel contracts here

Pa66 ML Exp6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pa66 ML Exp6

Uploaded by

Copyright:

Available Formats

B.

Name of the Experiment: K Means Clustering

Performed on: Marks Teacher’s Signature with date

Aim: To form clusters of given data using k-means clustering

Pre-requisite: concept of similarity measure, unsupervised learning

Components and equipment required:

Additional Links for reference:

Name: Pratik Mane

from sklearn.cluster import KMeans

# import dataset and add columm names to the imporated dataset

Name Age Income ($)

plt.scatter(df['Age'], df['Income ($)'])

Name Age Income ($) Cluster

# Dividing the dataset according to 3 cluster in three parts

plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')

# Dividing the dataset according to 3 cluster in three parts

plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='purple',marker='*',

# We are scaling Age and Income feature in the range between 0 to 1.

Name Age Income ($) Cluster

0 Rob 0.058824 0.213675 1

1 Michael 0.176471 0.384615 1

2 Mohan 0.176471 0.136752 2

3 Ismail 0.117647 0.128205 2

4 Kory 0.941176 0.897436 0

5 Gautam 0.764706 0.940171 0

6 David 0.882353 0.982906 0

7 Andrea 0.705882 1.000000 0

8 Brad 0.588235 0.948718 0

9 Angelina 0.529412 0.726496 0

10 Donald 0.647059 0.786325 0

11 Tom 0.000000 0.000000 2

12 Arnold 0.058824 0.025641 2

13 Jared 0.117647 0.051282 2

14 Stark 0.176471 0.038462 2

15 Ranbir 0.352941 0.068376 2

16 Dipak 0.823529 0.170940 2

17 Priyanka 0.882353 0.153846 2

18 Nick 1.000000 0.162393 2

19 Alia 0.764706 0.299145 1

20 Sid 0.882353 0.316239 1

Name Age Income ($) Cluster

0 Rob 0.058824 0.213675 1

1 Michael 0.176471 0.384615 1

2 Mohan 0.176471 0.136752 1

3 Ismail 0.117647 0.128205 1

4 Kory 0.941176 0.897436 2

5 Gautam 0.764706 0.940171 2

6 David 0.882353 0.982906 2

7 Andrea 0.705882 1.000000 2

8 Brad 0.588235 0.948718 2

9 Angelina 0.529412 0.726496 2

10 Donald 0.647059 0.786325 2

11 Tom 0.000000 0.000000 1

12 Arnold 0.058824 0.025641 1

13 Jared 0.117647 0.051282 1

14 Stark 0.176471 0.038462 1

15 Ranbir 0.352941 0.068376 1

16 Dipak 0.823529 0.170940 0

17 Priyanka 0.882353 0.153846 0

18 Nick 1.000000 0.162393 0

plt.scatter(df1.Age, df1['Income ($)'], color='green', label='Green Cluster')

plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], color='purple',marker='*',