You are on page 1of 20

14-10-2023

K-means Clustering
Strengths
-Simple iterative method
- User provides "K"
" Weaknesses
-Often too simple > bad results
- Difficult to guess the correct "K"

K-means Clustering
clusters.
Step-1: Select the number K to decide the number of
Step-2: Select random K points or centroids. (lIt can be other from the input
dataset).
which willform the
Step-3: Assign each data point to their closest centroid,
predefined K clusters.
each luster
Step-4: Calculate the variance and place a new centroid of
the
Step-5: Repeat the third steps, which means reassign each datapoint to
cluster.
new closest centroid of each
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

16
14-10-2023

K-means Clustering
" Iterate:
Calculate distance from obiects to cluster centroids.
-Assign objects to closest cluster
-Recalculate newcentroids
Stop based on convergence criteria
- No change in clusters
- Max iterations

K-means Issues

" Distance measure is squared Euclidean


- Scale should be similar in all dimensions
" Rescale data?
- Not good for nominal data. Why?
" Approach tries to minimize the within-cluster sum of
squares error (WCSS)
- Implicit assumption that SSE is similar for each aroup

17
K-MEANS CLUSTERING

K=?
NOW ITS K=2
35
Dr. Madhuri A. Tayal

2.Compute distance of every point from centroid and


1.Start with Kcentroids by putting them at random place.
cluster them accordingly
Here k2

gravity 4. Again re-cBuster every point based on their distance with


that they become center of centroid
3. Adiust centroids so
for given cluster

36

Dr. Madhuri A.Tayal


5.Again adjust centroids

37
Dr. Madhuri A. Tayal

Distance to New
Initial Centroids: Data Points Cluster
2 10 5 8 1 2 |Cluster
A1: (2, 10) A1 2 10
B1: (5, 8) A2 2 5
4
C1:(1, 2) A3
8
B1 5
B2 7 5
B3 6 4

C1 1 2

C2 4
Initial Centroids: Distance to New
Data Points cluster
A1: (2, 10)
2 10 5 1 2 Cluster
A1 10 0.00 3.61 8.06
B1: (5, 8) A2 2 5 5.00 4.24 3.16
C1: (1, 2) A3 4 8.49 5.00 7.28
B1 3.61 0.00 7.21
B2 7.07 3.61 6.71
B3 6 7.21 4.12 5.39
C1 1 8.06 7.21 0.00
C2 4 2.24 1.41 7.62

d(p;.P) = /-x)2 + (y2- y)2

Initial Centroids: Data Points Distance to


New
2 10 5 Cluster
A1: (2, 10) A1 2 10
1 2
Cluster
0.00 3.61
B1: (5, 8) 8.06
A2 2 5 5.00 4.24
C1: (1, 2) 3.16 3
A3 8.49 5.00 7.28
B1 5 8 2
3.61 0.00 7.21 2
B2 7 5 7.07 3.61 6.71
B3 6 4 7.21 2
4.12 5.39
C1 1 2 8.06 2
7.21 0.00
C2 4 2.24 3
1.41 7.62 2

d(p1,P2)= V(*z - X) + (y2 y)2


Distance to New
Initial Centroids: Data Points
2
Cluster
Cluster
2 10 5
A1: (2, 10) A1 10 0.00 3.61 8.06 1
B1: (5, 8) A2 5 5.00 4.24 3.16 3

C1: (1, 2) A3 8.49 5.00 7.28


B1 3.61 0.00 7.21 2
B2 7 5 7.07 3.61 6.71 2
New Centrojds: B3 6 7.21 4.12 5.39 2
A1: (2, 10) C1 1 2 8.06 7.21 0.00 3

B1: (6, 6) C2 4 9 2.24 1.41 7.62 2

C1: (1.5, 3.5)


d(p,.P) = z-x)2 + (2 - y;)2
B1: points -2 cluster =8+5+7+6+4/5=6
4+8+5+4+9/5=6

Distance to New
Current Centroids: Data Points
2 10 6 6 1.5 1.5
Cluster Cluster
A1: (2, 10) A1 2 10 0.00 5.66 6:52 1

B1: (6,6) A2 2 5.00 4.12 1.58 3

C1: (1.5, 3.5) A3 8 4 8.49 2.83 6.52


B 5 3.61 2.24 5.70 2
B2 7 5 7.07 1.41 5.70
B3 6 7.21 2.00 4.53
C1 1 8.06 6.40 1.58 3
C2 4 2.24 3.61 6.04 2

d(p,P) = V(xz-x)2 + , - yi)2


Current Centroids: Data Points Distance to
Cluster
New
A1: (2, 10) 2 10 6 6 1.5 1.5 Cluster
A1 2 10 0.00
B1: (6, 6) A2 2
5.66 6.52 1 1

5.00 4.12 1.58 3 3


C1: (1.5, 3.5) A3 8 4 8.49 2.83 6.52
B1 3.61 2.24 5.70 2
B2 7 5 7.07 1.41 5.70
B3 6 4 7.21 2.00 4.53 2 2
C1 1 2 8.06 6.40 1.58 3
C2 4 2.24 3.61 6.04 2

d(p1, Pz) = Vz-x)2 + (y2 - y)2


NewCentroids:
Al: (3, 9.5)
B1: (6.5, 5.25)
C1: (1.5, 3.5)

Distance to New
Current Centroids: Data Points
3 9.5 6.5 5.25 1.5 3.5 Cluster
Cluster
A1: (3, 9.5) A1 2 10 1.12 6.54 6.52
B1: (6.5, 5.25) A2 5 4.61 4.51 1.58 3
C1: (1.5, 3.5) A3 4 7.43 1.95 6.52 2
B1 2.50 3.13 5.70 2
B2 7 5 6.02 0.56 5.70 2
B3 6 6.26 1.35 4.53
C1 1 2 7.76 6.39 1.58
C2 4 1.12 4.51 6.04 1

d(p1,P2) = Vy-x)2 + 2 - y,)2


Current Centroids: Data Points Distance to
Cluster
New
A1: (3, 9.5) Al
9.5 6.5 5.25 1.5 3.5 Cluste!
10 1.12 6.54 6.52 1
B1: (6.5, 5.25) A2 2 4.61 4.51 1.58 3
C1: (1.5, 3.5) A3 8 4 7.43 1.95 6.52 2 2
B1 5 8 2.50 3.13 5.70
B2 7 5 6.02 0.56 5.70 2
B3 6 4 6.26 1.35 4.53 2 2
C: 1 2 7.76 6.39 1.58 3
C2 4 1.12 4.51 6.04

Current Centroids: Data Points Distance to New


3.67 7 4.33 1.5 3.5 Cluster
Cluster
A1: (3.67,9) A1 2 10 1.94 7.56 6.52 1
B1: (7, 4.33) A2 2 5 4.33 5.04 1.58 3 3
C1: (1.5, 3.5) A3 6.62 1.05 6.52 2
B1 5 8 1.67 4.18 5.70 1 1
B2 7 5 5.21 0.67 5.70 2 2
B3 6 5.52 1.05 4.53 2
C1 1 7.49 6.44 1.58
C2 4 0.33 5.55 6.04 1

Stop...as categorisation Tor current and previous step is same.


HOW TO DETERMINE K=?
K- No. of clusters

Dr. Madhuri A. Tayal 47

K=?
HERE DIFFRENT PEOPLE WILL
SAY K=2.4,6

Dr Madhuri A. Tayal 48

24
14-10-2023

Sum of Squared Error(SSE)?

SSE, =Xleodist(& - c SSE, =De,dist(k-)


|SSE, =EE, dist( -)?

SSE =SSE, +SSE,+.. +SSE;

Dr. Madhuri A. Tayal 49

Plot graphfor k=1,2,3,4,5...(allpoints)


Observations
As no. of clusters increases SSE goes down.

When reaches one point as indivisual cluster...error


reaches to zero.

Elbow Technique
S5E

SSE

8 9 10 I1
1 2 3 4 5 67

Dr. Madhuri A. Taval3 4 5 6 7 8 9 10 11


50
k=4
14-10-2023

fxom
sklearn.cluster
import pandas as pd
import KMeans plt.scat ter (df . Age, df (' Incone ($) '])
from plt.xlabel ('Age')
sklearn.preprocessing
from matplotlib import MinMaxScaler
import pyplot as plt
plt.ylabel (' Incone ($) ")
Smatplotlib inl ine <matplotlib. text.Text at Ox159c7655ac&>

df = pd.read 160000
df.head() csv("income.csv")
140000
Name Age Income($) 120000
0 Rob 27 70000
100000
1Michael 29 90000
80000
2 Mohan29 61000
80000
3|Ismail 28 60000
40000
|4 Kory 42 150000 275 30.0 325 35.0 374 40 Q 42 5
Age

Dr. Madhuri A. Tayal 51

isn = KMe ans (n clusters=3)


Ypredicted = . it predict fáf(['Age', 'Incone (S) "]1)
Ypredicted
array ([2, 2, 9, 0, i, , 2, i, 2, 1, 2, 0, O, 0, 0, 0, 0, 9, 0, 2, 2, 0))
df/'cluster'}=y predictad
df .hea ()

Name Age Income(S) ciuster


ORob 2770000 2

1Michael 29 90000 2

2Mohan29 61000
3Ismail 28 60000
4|Kory 47 150000

Dr. Madhuri A. Tayal


52
kn.cluster_centers_
array (lE 3.29090909e+01, 5.61363636e+04],
3.82857143e+01, 1.50000000e+05),
3.40000000e+01, 8.05000000e+04]])
dfl df [df. ciaster
df fdf.cluster
df3 df [df.cluster23
plt. scatter (dfl.Age, dfl[" ncome (S) 1,color'areen')
plt.scatter (af2.Age, df2[' Incone ($)),color -'red')
plt.scatter (df3. Age, d£3('Income (S)'1,color-"black')
plt.scatter (kt. cluster centers [:,01, km.cluster centers I:,11, color'purple',arkerr",labelcentroid)
plt.xlabel ('Age')
plt.ylabel ('Inccme (S) ')
plt.legend ()
<matplotlib. legend. Legend at Oxi59c7836128>
150000 incomets)
Income($)
140000 " income(5)
centroid

120000

100009
80000

60000

40000
27.5 325 35.0 375 425
Dr. Madhuri A. Tayal 53
Age

Preprocessing using min max scaler plt. scatter (df.Age, df i' Incone ($) "1)

MinMaxScaler ()
<tatplotlib. collecticns.PathCollection at Cx159c78f2358>
:Scaler

scaler. fit (df [['Income (S) ']])


10
($) )))
dE('Income ($) '] = Scaler. transform (df [['Income
08
scaler.fit (af [['Age']])
[['Age'])
df('Age'] = scaler. transforn (df 06

df.head ()
Income(S) cluster
Name Age 024
0 Rob 0.0588240.213675
|1| Michaell0.176471|0.384615 2
02 04 08
0.176471|0.136752 0
2 Mohan
0.1176470.128205 0 K eansin clastergmil
3 Ismail YEzedicxed .1 predict (af[("'Age',"hcoe (5)I
0.941176|0.897436 1 prodicted
4Kory
array ([0, 0, 8, 0, 1, i, 1, 1, 1, 1, 2, 0, 0, 0, 0, O, 2, 2, 2, 2, 2, 2)1
Dr. Madhuri A. Tayal 54

27
14-10-2023

df ['clustey'] y predicted
df.head()

Name Age Income($) cluster


0Rob 0.0588240.213675
1Michael 0.1764710.384615 0
2 Mohan0.1764710.1367520
df1 df [df.cluster0]
3 Ismail 0.117647 0.1282050 df2 df [df.cluster1)
|4 Kory 0.941176|0.897436 df3 df [df. cluster2)
plt.scatter (afl.Aye, dti [' lncoe ($) "], olox- "groen')
pit.scatter (af2.Age, df2('Incoe(S) ], colox='res't
kt.cluster centers plt.scatter (df3. Age, df3 ['income (S)'],color 'black')
array ([I 0.1372549 0.11633428), plt, scatter(km clugter cantezs I:,0), k clustey centers t:,ll,color 'purple',naxkar-tr*, label= 'centzoid')
I 0.722 68 908, 0.8974359 1 plt.legendt)
( 0.85294116, 0.2022792 13)
<natplolib. legend. Legerd at Ox:54c7982160 >
10
Incomeis)
Incoe(S)
D8
incones)
Centroid

06

)4

00

Dr. Madhuri A. Tayal 55

Elbow Plot

sse]
k rng range (1, 10)
£or k in k rng:
kmKMeans (n clusters-k)
km. fit (df [['Age','Income (S) ]])
sse.append (kn. inertia )

:plt.xlabel ('K')
plt.ylabel('Sun of squared error')
plt.plot (k_ rng,sse)
r<matplotlib.lines.Line2D at 0x159c7a34978>1

error
squared

of
Sum

4 Dr-Madhuri A. Taya 56
K

28
14-10-2023
3

Random Forest algorithm

" Pre-Requisite is---know decision trees

Random Forest Algorithm


. Random Forest models are a kind of non parametric
models that can be used both for
- regression
- classification.

One of the main drawbacks of Decision Trees is that they are


very prone to over-fitting: they do well on training data. but
are not so flexible for making predictions on unseen
samples.
14-10-2023

But why the name Random?


Where is the Randomness?
Lets find out by learning how a Random Forest model is built.

Training and Building a Random Forest

Step-1: Creating aBootstrapped Data Set for each tree.


Step-2: Train aforest of trees using these random data
sets, and add a little more randomness with the feature
selection.
Step-3: Repeat this for the N trees to create our awesome

forest.

30
14-10-2023

Step-1:Creating a Bootstrapped Data Set for each tree.


" To build a Random Forest we have to train N decision
trees.
" Do we use same sets?
" NO.....0000O
To train each individual tree, we pick a random sample of
the entire Data set, like shown in the following figure.

Data used to
train tree n 1
1
Entire Training Data set 4

1 Data used to
train tree n 2
2 2
5
3
Sampling with
4 Replacement or
5 Bootstrapping
6 Data used to
train tree n? N
5
3
1

The size of the data used to train each individual tree does not have to be the
size of the whole data set.
. A
data point can be present more than once in the data used to train asinol
tree (like in tree n° two).

31
Theme
If we train a forest witha lot of trees and each of them
has been trained with different data, we solve this
problem.

Step-2: Train a forest of trees using these random data


sets, and add a little more randomness with the feature
selection.
For building an individual decision tree, at each
node we evaluated a
certain metric (like the Gini index, or Information Gain)

picked the feature or variable of the data to go in the


minimised/maximised this metric. node that
Step-2 Train aforest of trees usingthese random data sets, and
add a little more randomness with the feature selection

Random features
Tree n 1 picked at Root Node
E A F

Data used to train tree n 1


Features Root Node (F)
ABCDE FG 3 Random features
Random features
picked at Node 2
picked at Nocde 1
1 B G
G D
1
4 1
4
4 Node 1 (G) Node 2 (A)
3 3
3

Step-3 Repeat this for the Ntrees to create our


awesome forest.

Repeat this for the Ntrees, randomly selecting on


each node of each of the trees which variables enter
the contestfor being picked as the feature to split on.
14-10-2023

Making predictions using a Random Forest


Making predictions with a Random Forest is very easy. We just have to take
each of our individual trees, pass the observation for which we want to make
aprediction through them, get a prediction from every tree (summing up to N
predictions) and then obtain an overall, aggregated prediction.

For regression problems,the aggregate decision is the average of the


decisions of every single decision tree.

For classification problems, the final prediction is the most frequent


prediction done by the forest.

For classification problems, the final prediction is the most frequent prediction
done by the forest.
Classification problem: Medical Diagnosis
New observation

Tree n 1 Healthy

Tree n 2 Sick

Tree n N-1 Healthy


Tree nN Healthy

Healthy: 355
Prediction: Healthy

Sick: 45
(Most frequent value
For regression problems, the aggregate decision is the average of the decisions
of every single decision tree.
Regression problem: House price estimation
orNew observation

Tree n 1 350.000$ L$

Tree n 2 275.550$

Tree n N-1 392.210$ LO

Tree n N 312.300s
Prediction: 322.750$

(Average numerical prediction)

" Local drive

Colab-code

You might also like