ML - Unit-6 KMeans

14-10-2023
K-means Clustering
Strengths
-Simple iterative method
- User provides "K"
" Weaknesses
-Often too simple > bad results
- Difficult to guess the correct "K"
K-means Clustering
clusters.
Step-1: Select the number K to decide the number of
Step-2: Select random K points or centroids. (lIt can be other from the input
dataset).
which willform the
Step-3: Assign each data point to their closest centroid,
predefined K clusters.
each luster
Step-4: Calculate the variance and place a new centroid of
the
Step-5: Repeat the third steps, which means reassign each datapoint to
cluster.
new closest centroid of each
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
16
14-10-2023
K-means Clustering
" Iterate:
Calculate distance from obiects to cluster centroids.
-Assign objects to closest cluster
-Recalculate newcentroids
Stop based on convergence criteria
- No change in clusters
- Max iterations
K-means Issues
" Distance measure is squared Euclidean

- Scale should be similar in all dimensions
" Rescale data?
- Not good for nominal data. Why?
" Approach tries to minimize the within-cluster sum of
squares error (WCSS)
- Implicit assumption that SSE is similar for each aroup
17
K-MEANS CLUSTERING
K=?
NOW ITS K=2
35
Dr. Madhuri A. Tayal
2.Compute distance of every point from centroid and

1.Start with Kcentroids by putting them at random place.
cluster them accordingly
Here k2
gravity 4. Again re-cBuster every point based on their distance with

that they become center of centroid
3. Adiust centroids so
for given cluster
36
Dr. Madhuri A.Tayal

5.Again adjust centroids
37
Distance to New
Initial Centroids: Data Points Cluster
2 10 5 8 1 2 |Cluster
A1: (2, 10) A1 2 10
B1: (5, 8) A2 2 5
4
C1:(1, 2) A3
8
B1 5
B2 7 5
B3 6 4
C1 1 2
C2 4
Initial Centroids: Distance to New
Data Points cluster
A1: (2, 10)
2 10 5 1 2 Cluster
A1 10 0.00 3.61 8.06
B1: (5, 8) A2 2 5 5.00 4.24 3.16
C1: (1, 2) A3 4 8.49 5.00 7.28
B1 3.61 0.00 7.21
B2 7.07 3.61 6.71
B3 6 7.21 4.12 5.39
C1 1 8.06 7.21 0.00
C2 4 2.24 1.41 7.62
d(p;.P) = /-x)2 + (y2- y)2
Initial Centroids: Data Points Distance to

New
2 10 5 Cluster
A1: (2, 10) A1 2 10
1 2
Cluster
0.00 3.61
B1: (5, 8) 8.06
A2 2 5 5.00 4.24
C1: (1, 2) 3.16 3
A3 8.49 5.00 7.28
B1 5 8 2
3.61 0.00 7.21 2
B2 7 5 7.07 3.61 6.71
B3 6 4 7.21 2
4.12 5.39
C1 1 2 8.06 2
7.21 0.00
C2 4 2.24 3
1.41 7.62 2
d(p1,P2)= V(*z - X) + (y2 y)2

Distance to New
Initial Centroids: Data Points
2
Cluster
Cluster
2 10 5
A1: (2, 10) A1 10 0.00 3.61 8.06 1
B1: (5, 8) A2 5 5.00 4.24 3.16 3
C1: (1, 2) A3 8.49 5.00 7.28

B1 3.61 0.00 7.21 2
B2 7 5 7.07 3.61 6.71 2
New Centrojds: B3 6 7.21 4.12 5.39 2
A1: (2, 10) C1 1 2 8.06 7.21 0.00 3
B1: (6, 6) C2 4 9 2.24 1.41 7.62 2
C1: (1.5, 3.5)

d(p,.P) = z-x)2 + (2 - y;)2
B1: points -2 cluster =8+5+7+6+4/5=6
4+8+5+4+9/5=6
Distance to New
Current Centroids: Data Points
2 10 6 6 1.5 1.5
Cluster Cluster
A1: (2, 10) A1 2 10 0.00 5.66 6:52 1
B1: (6,6) A2 2 5.00 4.12 1.58 3
C1: (1.5, 3.5) A3 8 4 8.49 2.83 6.52

B 5 3.61 2.24 5.70 2
B2 7 5 7.07 1.41 5.70
B3 6 7.21 2.00 4.53
C1 1 8.06 6.40 1.58 3
C2 4 2.24 3.61 6.04 2
d(p,P) = V(xz-x)2 + , - yi)2

Current Centroids: Data Points Distance to
Cluster
New
A1: (2, 10) 2 10 6 6 1.5 1.5 Cluster
A1 2 10 0.00
B1: (6, 6) A2 2
5.66 6.52 1 1
5.00 4.12 1.58 3 3

C1: (1.5, 3.5) A3 8 4 8.49 2.83 6.52
B1 3.61 2.24 5.70 2
B2 7 5 7.07 1.41 5.70
B3 6 4 7.21 2.00 4.53 2 2
C1 1 2 8.06 6.40 1.58 3
C2 4 2.24 3.61 6.04 2
d(p1, Pz) = Vz-x)2 + (y2 - y)2

NewCentroids:
Al: (3, 9.5)
B1: (6.5, 5.25)
C1: (1.5, 3.5)
Distance to New
Current Centroids: Data Points
3 9.5 6.5 5.25 1.5 3.5 Cluster
Cluster
A1: (3, 9.5) A1 2 10 1.12 6.54 6.52
B1: (6.5, 5.25) A2 5 4.61 4.51 1.58 3
C1: (1.5, 3.5) A3 4 7.43 1.95 6.52 2
B1 2.50 3.13 5.70 2
B2 7 5 6.02 0.56 5.70 2
B3 6 6.26 1.35 4.53
C1 1 2 7.76 6.39 1.58
C2 4 1.12 4.51 6.04 1
d(p1,P2) = Vy-x)2 + 2 - y,)2

Current Centroids: Data Points Distance to
Cluster
New
A1: (3, 9.5) Al
9.5 6.5 5.25 1.5 3.5 Cluste!
10 1.12 6.54 6.52 1
B1: (6.5, 5.25) A2 2 4.61 4.51 1.58 3
C1: (1.5, 3.5) A3 8 4 7.43 1.95 6.52 2 2
B1 5 8 2.50 3.13 5.70
B2 7 5 6.02 0.56 5.70 2
B3 6 4 6.26 1.35 4.53 2 2
C: 1 2 7.76 6.39 1.58 3
C2 4 1.12 4.51 6.04
Current Centroids: Data Points Distance to New

3.67 7 4.33 1.5 3.5 Cluster
Cluster
A1: (3.67,9) A1 2 10 1.94 7.56 6.52 1
B1: (7, 4.33) A2 2 5 4.33 5.04 1.58 3 3
C1: (1.5, 3.5) A3 6.62 1.05 6.52 2
B1 5 8 1.67 4.18 5.70 1 1
B2 7 5 5.21 0.67 5.70 2 2
B3 6 5.52 1.05 4.53 2
C1 1 7.49 6.44 1.58
C2 4 0.33 5.55 6.04 1
Stop...as categorisation Tor current and previous step is same.

HOW TO DETERMINE K=?
K- No. of clusters
Dr. Madhuri A. Tayal 47
K=?
HERE DIFFRENT PEOPLE WILL
SAY K=2.4,6
Dr Madhuri A. Tayal 48
24
14-10-2023
Sum of Squared Error(SSE)?
SSE, =Xleodist(& - c SSE, =De,dist(k-)

|SSE, =EE, dist( -)?
SSE =SSE, +SSE,+.. +SSE;
Plot graphfor k=1,2,3,4,5...(allpoints)

Observations
As no. of clusters increases SSE goes down.
When reaches one point as indivisual cluster...error

reaches to zero.
Elbow Technique
S5E
SSE
8 9 10 I1
1 2 3 4 5 67
Dr. Madhuri A. Taval3 4 5 6 7 8 9 10 11

50
k=4
14-10-2023
fxom
sklearn.cluster
import pandas as pd
import KMeans plt.scat ter (df . Age, df (' Incone ($) '])
from plt.xlabel ('Age')
sklearn.preprocessing
from matplotlib import MinMaxScaler
import pyplot as plt
plt.ylabel (' Incone ($) ")
Smatplotlib inl ine <matplotlib. text.Text at Ox159c7655ac&>
df = pd.read 160000
df.head() csv("income.csv")
140000
Name Age Income($) 120000
0 Rob 27 70000
100000
1Michael 29 90000
80000
2 Mohan29 61000
80000
3|Ismail 28 60000
40000
|4 Kory 42 150000 275 30.0 325 35.0 374 40 Q 42 5
Age
isn = KMe ans (n clusters=3)

Ypredicted = . it predict fáf(['Age', 'Incone (S) "]1)
Ypredicted
array ([2, 2, 9, 0, i, , 2, i, 2, 1, 2, 0, O, 0, 0, 0, 0, 9, 0, 2, 2, 0))
df/'cluster'}=y predictad
df .hea ()
Name Age Income(S) ciuster

ORob 2770000 2
1Michael 29 90000 2
2Mohan29 61000
3Ismail 28 60000
4|Kory 47 150000

52
kn.cluster_centers_
array (lE 3.29090909e+01, 5.61363636e+04],
3.82857143e+01, 1.50000000e+05),
3.40000000e+01, 8.05000000e+04]])
dfl df [df. ciaster
df fdf.cluster
df3 df [df.cluster23
plt. scatter (dfl.Age, dfl[" ncome (S) 1,color'areen')
plt.scatter (af2.Age, df2[' Incone ($)),color -'red')
plt.scatter (df3. Age, d£3('Income (S)'1,color-"black')
plt.scatter (kt. cluster centers [:,01, km.cluster centers I:,11, color'purple',arkerr",labelcentroid)
plt.xlabel ('Age')
plt.ylabel ('Inccme (S) ')
plt.legend ()
<matplotlib. legend. Legend at Oxi59c7836128>
150000 incomets)
Income($)
140000 " income(5)
centroid
120000
100009
80000
60000
40000
27.5 325 35.0 375 425
Age
Preprocessing using min max scaler plt. scatter (df.Age, df i' Incone ($) "1)
MinMaxScaler ()
<tatplotlib. collecticns.PathCollection at Cx159c78f2358>
:Scaler
scaler. fit (df [['Income (S) ']])

10
($) )))
dE('Income ($) '] = Scaler. transform (df [['Income
08
scaler.fit (af [['Age']])
[['Age'])
df('Age'] = scaler. transforn (df 06
df.head ()
Income(S) cluster
Name Age 024
0 Rob 0.0588240.213675
|1| Michaell0.176471|0.384615 2
02 04 08
0.176471|0.136752 0
2 Mohan
0.1176470.128205 0 K eansin clastergmil
3 Ismail YEzedicxed .1 predict (af[("'Age',"hcoe (5)I
0.941176|0.897436 1 prodicted
4Kory
array ([0, 0, 8, 0, 1, i, 1, 1, 1, 1, 2, 0, 0, 0, 0, O, 2, 2, 2, 2, 2, 2)1
27
14-10-2023
df ['clustey'] y predicted
df.head()
Name Age Income($) cluster

0Rob 0.0588240.213675
1Michael 0.1764710.384615 0
2 Mohan0.1764710.1367520
df1 df [df.cluster0]
3 Ismail 0.117647 0.1282050 df2 df [df.cluster1)
|4 Kory 0.941176|0.897436 df3 df [df. cluster2)
plt.scatter (afl.Aye, dti [' lncoe ($) "], olox- "groen')
pit.scatter (af2.Age, df2('Incoe(S) ], colox='res't
kt.cluster centers plt.scatter (df3. Age, df3 ['income (S)'],color 'black')
array ([I 0.1372549 0.11633428), plt, scatter(km clugter cantezs I:,0), k clustey centers t:,ll,color 'purple',naxkar-tr*, label= 'centzoid')
I 0.722 68 908, 0.8974359 1 plt.legendt)
( 0.85294116, 0.2022792 13)
<natplolib. legend. Legerd at Ox:54c7982160 >
10
Incomeis)
Incoe(S)
D8
incones)
Centroid
06
)4
00
Elbow Plot
sse]
k rng range (1, 10)
£or k in k rng:
kmKMeans (n clusters-k)
km. fit (df [['Age','Income (S) ]])
sse.append (kn. inertia )
:plt.xlabel ('K')
plt.ylabel('Sun of squared error')
plt.plot (k_ rng,sse)
r<matplotlib.lines.Line2D at 0x159c7a34978>1
error
squared
of
Sum
4 Dr-Madhuri A. Taya 56
K
28
14-10-2023
3
Random Forest algorithm
" Pre-Requisite is---know decision trees
Random Forest Algorithm

. Random Forest models are a kind of non parametric
models that can be used both for
- regression
- classification.
One of the main drawbacks of Decision Trees is that they are

very prone to over-fitting: they do well on training data. but
are not so flexible for making predictions on unseen
samples.
14-10-2023
But why the name Random?

Where is the Randomness?
Lets find out by learning how a Random Forest model is built.
Training and Building a Random Forest
Step-1: Creating aBootstrapped Data Set for each tree.

Step-2: Train aforest of trees using these random data
sets, and add a little more randomness with the feature
selection.
Step-3: Repeat this for the N trees to create our awesome
forest.
30
14-10-2023
Step-1:Creating a Bootstrapped Data Set for each tree.

" To build a Random Forest we have to train N decision
trees.
" Do we use same sets?
" NO.....0000O
To train each individual tree, we pick a random sample of
the entire Data set, like shown in the following figure.
Data used to
train tree n 1
1
Entire Training Data set 4
1 Data used to
train tree n 2
2 2
5
3
Sampling with
4 Replacement or
5 Bootstrapping
6 Data used to
train tree n? N
5
3
1
The size of the data used to train each individual tree does not have to be the
size of the whole data set.
. A
data point can be present more than once in the data used to train asinol
tree (like in tree n° two).
31
Theme
If we train a forest witha lot of trees and each of them
has been trained with different data, we solve this
problem.
Step-2: Train a forest of trees using these random data

sets, and add a little more randomness with the feature
selection.
For building an individual decision tree, at each
node we evaluated a
certain metric (like the Gini index, or Information Gain)
picked the feature or variable of the data to go in the

minimised/maximised this metric. node that
Step-2 Train aforest of trees usingthese random data sets, and
add a little more randomness with the feature selection
Random features
Tree n 1 picked at Root Node
E A F
Data used to train tree n 1

Features Root Node (F)
ABCDE FG 3 Random features
Random features
picked at Node 2
picked at Nocde 1
1 B G
G D
1
4 1
4
4 Node 1 (G) Node 2 (A)
3 3
3
Step-3 Repeat this for the Ntrees to create our

awesome forest.
Repeat this for the Ntrees, randomly selecting on

each node of each of the trees which variables enter
the contestfor being picked as the feature to split on.
14-10-2023
Making predictions using a Random Forest

Making predictions with a Random Forest is very easy. We just have to take
each of our individual trees, pass the observation for which we want to make
aprediction through them, get a prediction from every tree (summing up to N
predictions) and then obtain an overall, aggregated prediction.
For regression problems,the aggregate decision is the average of the

decisions of every single decision tree.
For classification problems, the final prediction is the most frequent

prediction done by the forest.
For classification problems, the final prediction is the most frequent prediction
done by the forest.
Classification problem: Medical Diagnosis
New observation
Tree n 1 Healthy
Tree n 2 Sick
Tree n N-1 Healthy

Tree nN Healthy
Healthy: 355
Prediction: Healthy
Sick: 45
(Most frequent value
For regression problems, the aggregate decision is the average of the decisions
of every single decision tree.
Regression problem: House price estimation
orNew observation
Tree n 1 350.000$ L$
Tree n 2 275.550$
Tree n N-1 392.210$ LO
Tree n N 312.300s
Prediction: 322.750$
(Average numerical prediction)
" Local drive
Colab-code

ML - Unit-6 KMeans

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML - Unit-6 KMeans

Uploaded by

Copyright:

Available Formats

14-10-2023

" Distance measure is squared Euclidean

2.Compute distance of every point from centroid and

gravity 4. Again re-cBuster every point based on their distance with

Dr. Madhuri A.Tayal

d(p;.P) = /-x)2 + (y2- y)2

Initial Centroids: Data Points Distance to

d(p1,P2)= V(*z - X) + (y2 y)2

C1: (1, 2) A3 8.49 5.00 7.28

B1: (6, 6) C2 4 9 2.24 1.41 7.62 2

C1: (1.5, 3.5)

B1: (6,6) A2 2 5.00 4.12 1.58 3

C1: (1.5, 3.5) A3 8 4 8.49 2.83 6.52

d(p,P) = V(xz-x)2 + , - yi)2

5.00 4.12 1.58 3 3

d(p1, Pz) = Vz-x)2 + (y2 - y)2

d(p1,P2) = Vy-x)2 + 2 - y,)2

Current Centroids: Data Points Distance to New

Stop...as categorisation Tor current and previous step is same.

Dr. Madhuri A. Tayal 47

Sum of Squared Error(SSE)?

SSE, =Xleodist(& - c SSE, =De,dist(k-)

SSE =SSE, +SSE,+.. +SSE;

Dr. Madhuri A. Tayal 49

Plot graphfor k=1,2,3,4,5...(allpoints)

When reaches one point as indivisual cluster...error

Dr. Madhuri A. Taval3 4 5 6 7 8 9 10 11

Dr. Madhuri A. Tayal 51

isn = KMe ans (n clusters=3)

Name Age Income(S) ciuster

Dr. Madhuri A. Tayal

scaler. fit (df [['Income (S) ']])

Name Age Income($) cluster

Dr. Madhuri A. Tayal 55

Random Forest algorithm

" Pre-Requisite is---know decision trees

Random Forest Algorithm

One of the main drawbacks of Decision Trees is that they are

But why the name Random?

Training and Building a Random Forest

Step-1: Creating aBootstrapped Data Set for each tree.

Step-1:Creating a Bootstrapped Data Set for each tree.

Step-2: Train a forest of trees using these random data

picked the feature or variable of the data to go in the

Data used to train tree n 1

Step-3 Repeat this for the Ntrees to create our

Repeat this for the Ntrees, randomly selecting on

Making predictions using a Random Forest

For regression problems,the aggregate decision is the average of the

For classification problems, the final prediction is the most frequent

Tree n N-1 Healthy

Tree n N-1 392.210$ LO

(Average numerical prediction)

" Local drive

You might also like