Professional Documents
Culture Documents
K-means Clustering
Strengths
-Simple iterative method
- User provides "K"
" Weaknesses
-Often too simple > bad results
- Difficult to guess the correct "K"
K-means Clustering
clusters.
Step-1: Select the number K to decide the number of
Step-2: Select random K points or centroids. (lIt can be other from the input
dataset).
which willform the
Step-3: Assign each data point to their closest centroid,
predefined K clusters.
each luster
Step-4: Calculate the variance and place a new centroid of
the
Step-5: Repeat the third steps, which means reassign each datapoint to
cluster.
new closest centroid of each
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
16
14-10-2023
K-means Clustering
" Iterate:
Calculate distance from obiects to cluster centroids.
-Assign objects to closest cluster
-Recalculate newcentroids
Stop based on convergence criteria
- No change in clusters
- Max iterations
K-means Issues
17
K-MEANS CLUSTERING
K=?
NOW ITS K=2
35
Dr. Madhuri A. Tayal
36
37
Dr. Madhuri A. Tayal
Distance to New
Initial Centroids: Data Points Cluster
2 10 5 8 1 2 |Cluster
A1: (2, 10) A1 2 10
B1: (5, 8) A2 2 5
4
C1:(1, 2) A3
8
B1 5
B2 7 5
B3 6 4
C1 1 2
C2 4
Initial Centroids: Distance to New
Data Points cluster
A1: (2, 10)
2 10 5 1 2 Cluster
A1 10 0.00 3.61 8.06
B1: (5, 8) A2 2 5 5.00 4.24 3.16
C1: (1, 2) A3 4 8.49 5.00 7.28
B1 3.61 0.00 7.21
B2 7.07 3.61 6.71
B3 6 7.21 4.12 5.39
C1 1 8.06 7.21 0.00
C2 4 2.24 1.41 7.62
Distance to New
Current Centroids: Data Points
2 10 6 6 1.5 1.5
Cluster Cluster
A1: (2, 10) A1 2 10 0.00 5.66 6:52 1
Distance to New
Current Centroids: Data Points
3 9.5 6.5 5.25 1.5 3.5 Cluster
Cluster
A1: (3, 9.5) A1 2 10 1.12 6.54 6.52
B1: (6.5, 5.25) A2 5 4.61 4.51 1.58 3
C1: (1.5, 3.5) A3 4 7.43 1.95 6.52 2
B1 2.50 3.13 5.70 2
B2 7 5 6.02 0.56 5.70 2
B3 6 6.26 1.35 4.53
C1 1 2 7.76 6.39 1.58
C2 4 1.12 4.51 6.04 1
K=?
HERE DIFFRENT PEOPLE WILL
SAY K=2.4,6
Dr Madhuri A. Tayal 48
24
14-10-2023
Elbow Technique
S5E
SSE
8 9 10 I1
1 2 3 4 5 67
fxom
sklearn.cluster
import pandas as pd
import KMeans plt.scat ter (df . Age, df (' Incone ($) '])
from plt.xlabel ('Age')
sklearn.preprocessing
from matplotlib import MinMaxScaler
import pyplot as plt
plt.ylabel (' Incone ($) ")
Smatplotlib inl ine <matplotlib. text.Text at Ox159c7655ac&>
df = pd.read 160000
df.head() csv("income.csv")
140000
Name Age Income($) 120000
0 Rob 27 70000
100000
1Michael 29 90000
80000
2 Mohan29 61000
80000
3|Ismail 28 60000
40000
|4 Kory 42 150000 275 30.0 325 35.0 374 40 Q 42 5
Age
1Michael 29 90000 2
2Mohan29 61000
3Ismail 28 60000
4|Kory 47 150000
120000
100009
80000
60000
40000
27.5 325 35.0 375 425
Dr. Madhuri A. Tayal 53
Age
Preprocessing using min max scaler plt. scatter (df.Age, df i' Incone ($) "1)
MinMaxScaler ()
<tatplotlib. collecticns.PathCollection at Cx159c78f2358>
:Scaler
df.head ()
Income(S) cluster
Name Age 024
0 Rob 0.0588240.213675
|1| Michaell0.176471|0.384615 2
02 04 08
0.176471|0.136752 0
2 Mohan
0.1176470.128205 0 K eansin clastergmil
3 Ismail YEzedicxed .1 predict (af[("'Age',"hcoe (5)I
0.941176|0.897436 1 prodicted
4Kory
array ([0, 0, 8, 0, 1, i, 1, 1, 1, 1, 2, 0, 0, 0, 0, O, 2, 2, 2, 2, 2, 2)1
Dr. Madhuri A. Tayal 54
27
14-10-2023
df ['clustey'] y predicted
df.head()
06
)4
00
Elbow Plot
sse]
k rng range (1, 10)
£or k in k rng:
kmKMeans (n clusters-k)
km. fit (df [['Age','Income (S) ]])
sse.append (kn. inertia )
:plt.xlabel ('K')
plt.ylabel('Sun of squared error')
plt.plot (k_ rng,sse)
r<matplotlib.lines.Line2D at 0x159c7a34978>1
error
squared
of
Sum
4 Dr-Madhuri A. Taya 56
K
28
14-10-2023
3
forest.
30
14-10-2023
Data used to
train tree n 1
1
Entire Training Data set 4
1 Data used to
train tree n 2
2 2
5
3
Sampling with
4 Replacement or
5 Bootstrapping
6 Data used to
train tree n? N
5
3
1
The size of the data used to train each individual tree does not have to be the
size of the whole data set.
. A
data point can be present more than once in the data used to train asinol
tree (like in tree n° two).
31
Theme
If we train a forest witha lot of trees and each of them
has been trained with different data, we solve this
problem.
Random features
Tree n 1 picked at Root Node
E A F
For classification problems, the final prediction is the most frequent prediction
done by the forest.
Classification problem: Medical Diagnosis
New observation
Tree n 1 Healthy
Tree n 2 Sick
Healthy: 355
Prediction: Healthy
Sick: 45
(Most frequent value
For regression problems, the aggregate decision is the average of the decisions
of every single decision tree.
Regression problem: House price estimation
orNew observation
Tree n 1 350.000$ L$
Tree n 2 275.550$
Tree n N 312.300s
Prediction: 322.750$
Colab-code