Professional Documents
Culture Documents
Week 9 – Clustering
PROF. DR. GÖKHAN SILAHTAROĞLU
Lecturer:. NADA MISK
MODEL VALIDATION
MODEL VALIDATION
Holdout Method
MODEL VALIDATION
After completing the learning process and making the estimation on the test data, calculating how reliable
this estimation is.
𝐸𝑟𝑟𝑜𝑟 = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑡𝑝
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝+𝑓𝑛
𝑡𝑝
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝
𝑡𝑛
(𝑆𝑝𝑒𝑐𝑖𝑡𝑖𝑣𝑖𝑡𝑦) = 𝑡𝑛+𝑡𝑝
EXAMPLE
https://www.youtube.com/watch?v=4jRBRDbJemM
EVALUATION OF
PREDICTIVE
SUCCESS OF
MODELS
VARIANCE – BIAS
To reduce underfitting;
• You can increase the model complexity.
• Clean up the noise (obvious, non-linearity, nonsense data, confusing the
model) from the data.
• Increase the epoch number.
• Increase training time.
To reduce overfitting;
• Increase training data.
https://www.youtube.com/watch?v=EuBBz3bI-aA • Reduce model complexity.
• During the training phase, use early-stop
https://www.youtube.com/watch?v=ms-Ooh9mjiE
CLUSTERING
Parameters used:
Distance and Similarity Measures.
Cat Dog Example
There is no concept of class!
Areas of use
Customer segmentation,
DNA Identification
Finger print,
Face or Voice Recognition
Product Recommendation
Customer segmentation
How often, how much purchases
which product how many times viewed
How long it took to review this product
KÜMELEME ÖRNEĞİ
CLUSTER ANALYSIS
Clustering is the process of dividing data into clusters in the correct ratio with their similarities.
similarity dissimilarity
AREAS WHERE CLUSTERING IS USED
In Computer Science,
Shopping Statistics
Patterns of voice, character, image recognition and machine learning
Groups
In statistics,
Astronomy
Biology multivariate statistical estimation and pattern recognition
issues
In clustering,
similarities are revealed by using various distance measures.
In terms of the accuracy of the results, it is necessary to pre-process the data in order to minimize the
errors.
• In order for a clustering algorithm to group data, it must know what pairs of similar samples are.
euclidean distance :
Minkowski Distance:
Manhattan Distance:
Hillbert Distance
euclidean distance
DISTANCE
in a database that we will denote as D.:
D = { X 1 , X 2 , X 3 ,..., X n }
n = 1, 2 …m
𝑋𝑚 = {2,4,7}
𝑋𝑗 = {1,6,5}
𝑛
𝑖=1
= −1 2 + −2 2 + 2 2 = 9 = 2.236
EXAMPLE
X m = {6,5,5}
Calculate whether point Xm is closer to Xj or Xk using
𝑋𝑗 = {5,4,1} the Euclidean distance.
X k = {5,3,4}.
𝑛
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖
𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐶𝑂𝑆𝐼𝑁𝐸 =
81
σ𝑛𝑖=1 𝑥𝑚𝑖
2 σ𝑛 2
𝑖=1 𝑥𝑗𝑖 = 0.985
104.65
EXAMPLE
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
• When a data is included in the cluster K as a result of the calculations, how much
will the distance or similarity of this cluster K change to other clusters?
• Has this cluster K reached sufficient size?
• If this cluster is divided into subclusters within itself, will these new subclusters be
really different from each other?
N
( xmi − X 0 ) 2
i =1
N Radius = R =
x mi N
N N
( x mi − x mj ) 2
i =1
Centroid =X 0 = i =1 j =1
N Diameter =D =
N ( N − 1)
DISTANCE CALCULATION METHODS BETWEEN CLUSTERS
The Euclidean distance between two clusters (K1 and K2) with centers X01 and X02
can be calculated as follows.
6 3 6 5 5 10
7 2 4
CENTER 5.4 7.4
8 3 4
RADIUS 1.04 3.44
9 3 3
10 2 3
DISTANCE ( K1 , K 2 ) = ( X 01 − X 02 ) 2
CENTER 3 4.2
RADIUS 0.8 0.76 𝐶𝑒𝑛𝑡𝑒𝑟_𝑑𝑖𝑠𝑡(𝐾1 , 𝐾2 ) = (3 − 5.4)2 + (4.2 − 7.4)2 = 16 = 4
EXAMPLE
K2 CLUSTER
K1 CLUSTER
ID A VALUE B VALUE
ID A VALUE B VALUE
1 4 7
1 3 4
2 5 5
2 5 5
3 6 9
3 4 4
4 7 6
4 3 5
5 5 10
5 2 4
6 3 6
CENTER 5.4 7.4
7 2 4
8 3 4 RADIUS 1.04 3.44
9 3 3
10 2 3
https://www.youtube.com/watch?v=4b5d3muPQmA
Fuzzy C-Means (FCM) Algorithm
When aggregation or partitioning methods are used, data X1 will only be in cluster
1 and X2 will be in cluster 2. However, in the FCM algorithm, a data can be in both
cluster 1 and 2.
The membership value for X1 can be to cluster 1, m11=0.73, and the membership
value of X1 to cluster 2, m12=0.27. Similarly, membership values of X2 data; cluster 2,
m22=0.2 and cluster1, m21=0.8.
Fuzzy C-Means (FCM) Algorithm
The sum of the membership values of a data to all classes must be "1".
The clustering process is completed when the objective function converges to the
determined minimum progress value.
. 𝑗𝑚 = σ𝑛𝑖=1 σ𝑐𝑗=1 𝑢 𝑚 𝑋𝑖 − 𝐶j 2
, 1 ≤𝑚 ≤ ∞
𝑖𝑗 .
FUZZY C-MEANS (FCM) ALGORİTHM
1
U𝑖𝑗 = 𝑋𝑖 −𝐶𝑖 2
σ𝑐𝑘=1 𝑋𝑖 −𝐶𝑘 ൗ(𝑚−1)
cj = i =1
Centroid cj , m=2 N
c1=13.16; c2=11.81
u
i =1
m
ij
Repeat the process until change in membership values (for example) is less than
0.01
FUZZY C-MEANS EXAMPLE
▪ When we take the fuzziness value of 'm' as 1, there will be no fuzzy and the U matrix will be;
0 0 0 1 1 1
1 1 1 0 0 0
INDEX AND ALGORITHM FOR CALCULATING THE
OPTIMUM NUMBER OF CLUSTERS
• However, many algorithms require the user to enter the initial number of clusters.
• In this case, the user must either determine the optimum number of clusters by trial
and error, or perform a number of tests after each clustering and calculate which
one yields more efficient results.
• max
Silhouette
• max
Dunn
• max
Calinski-Harabasz Indexes used to measure the success
of clustering analyzes
• min
Davies-Bouldin
• min
Xie-Beni
XIE- BENI INDEX
Let's calculate the XIE Beni index value of the data cluster that we have divided into two clusters.
X DISTANCE TO M1 DISTANCE TO M2
3 0.9519 0.0481
2 7 0.9974 0.0026
σ𝐾 𝑁 2
𝑖 σ𝑗 𝑈𝑖𝑗 ฮ𝑥𝑗 − 𝑚𝑖 ԡ /𝑁
𝑋𝐵 = 10 0.842 0.158
2
𝑚𝑖𝑛𝑖≠𝑗 ԡ𝑚𝑖 −𝑚𝑗 ฮ 17 0.0137 0.9863
18 0.0004 0.9996
20 0.0164 0.9836
XIE- BENI INDEX
Denominator calculation;
2 2
𝑚𝑖𝑛𝑖≠𝑗 ԡ𝑚𝑖 −𝑚𝑗 ฮ = 6.4287 − 18.2453 =139.6320356
In this case,
2
σ𝐾 𝑁 2
𝑖 σ𝑗 𝑈𝑖𝑗 ฮ𝑥𝑗 −𝑚𝑖 ԡ /𝑁 7.254129867
𝑋𝐵 = 2 = = 0.051951759
𝑚𝑖𝑛𝑖≠𝑗 ԡ𝑚𝑖 −𝑚𝑗 ฮ 139.6320356
PARTITION COEFFICIENT
This value calculates a validity value using only the membership matrix.
1 𝑐
𝑃𝐶 = σ𝑖=1 σ𝑛𝑗=1 𝑈𝑖𝑗
2 u matix
𝑁 Cluster 1 Cluster 2 u^2 cluster 1 u^2 cluster 2
0.9519 0.0481 ------> 0.90611361 0.00231361
0.9974 0.0026 ------> 0.99480676 0.00000676
0.842 0.158 ------> 0.708964 0.024964
0.0137 0.9863 ------> 0.00018769 0.97278769
0.0004 0.9996 ------> 0.00000016 0.99920016
0.0164 0.9836 ------> 0.00026896 0.96746896
▪ The Silhouette validity index calculates the Silhouette width for each sample, the
average Silhouette for each cluster, and the average Silhouette width for the entire
data cluster.
▪ As with the validity indices, it aims to capture the situations where the
homogeneity within the cluster is high and the clusters are disconnected from
each other.
𝑏 𝑖 −𝑎 𝑖
𝑆(𝑖) =
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖
SILHOUETTE VALIDATION METHOD
According to the condition T<=16, it is divided into two sets as A= {1,2,3} and B =
{4,5,6}. Evaluation of cluster A of the data cluster with Silhouette Validation Method
SILHOUETTE VALIDATION METHOD
EXAMPLE
𝑏 𝑖 −𝑎 𝑖 89.8 − 11.45
𝑆(𝑟𝑒𝑐𝑜𝑟𝑑 1) = = = 0.872
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖 89.8
𝑏 𝑖 −𝑎 𝑖 98.4 − 18.55
𝑆(𝑅𝑒𝑐𝑜𝑟𝑑 2) = = = 0.812
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖 98.4
SILHOUETTE VALIDATION METHOD
EXAMPLE
𝑏 𝑖 −𝑎 𝑖 84.2 − 14.9
𝑆(𝑅𝑒𝑐𝑜𝑟𝑑 3) = = = 0.823
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖 84.2