Week 9 - Clustering

MANAGEMENT INFORMATION SYSTEMS
DATA MINING AND BUSINESS INTELLIGENCE
Week 9 – Clustering
PROF. DR. GÖKHAN SILAHTAROĞLU
Lecturer:. NADA MISK
MODEL VALIDATION
MODEL VALIDATION
Holdout Method
MODEL VALIDATION
K Fold Cross Validation

MODEL SUCCESSFUL EVALUATION
REGRESSION
CLASSIFICATION
Confusion Matrix
CLASSIFICATION
PERFORMANCE MEASUREMENT
 After completing the learning process and making the estimation on the test data, calculating how reliable
this estimation is.
Predicted // Actual (Actual) Positive (Actual) Negative
(Predicted) Positive True Positive False Positive
(Predicted) Negative False Negative True Negative

PERFORMANCE MEASUREMENT
𝑡𝑝+𝑡𝑛
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑑𝑎𝑡𝑎
𝐸𝑟𝑟𝑜𝑟 = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑡𝑝
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝+𝑓𝑛
𝑡𝑝
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝
𝑡𝑛
(𝑆𝑝𝑒𝑐𝑖𝑡𝑖𝑣𝑖𝑡𝑦) = 𝑡𝑛+𝑡𝑝
EXAMPLE
Sepsis prediction distribution on a test datacluster with a total of 1,250 data
In fact, a total of 800 patients have sepsis, 450 patients do not.
SEPSIS NOT SEPSIS TOTAL

(PREDICTED) SEPSIS 500 (tp) 50(fp) 550
(PREDICTED) NOT 300 (fn) 400 (tn) 700
SEPSIS
800 450 1250
EXAMPLE
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
PERFORMANCE METRIC DEĞER
ACCURACY 0.72
ERROR 0.28
SENSITIVITY 0.63
POSITIVE PREDICTIVE VALUE 0.91
F-SCORE 0.74
PRECISION 0.89
NEGATIVE PREDICTIVE VALUE 0.57
FALSE POSITIVE RATE 0.11
FALSE NEGAITVE RATE 0.38
POZITIVE LIKELIHOOD RATIO 5.63
NEGATIVE LIKELIHOOD RATIO 0.42
DIAGNOSTIC ODDS RATIO (DOR) 13.33
CLASSIFICATION
https://www.youtube.com/watch?v=4jRBRDbJemM
EVALUATION OF
PREDICTIVE
SUCCESS OF
MODELS
VARIANCE – BIAS
To reduce underfitting;
• You can increase the model complexity.
• Clean up the noise (obvious, non-linearity, nonsense data, confusing the
model) from the data.
• Increase the epoch number.
• Increase training time.
To reduce overfitting;
• Increase training data.
https://www.youtube.com/watch?v=EuBBz3bI-aA • Reduce model complexity.
• During the training phase, use early-stop
https://www.youtube.com/watch?v=ms-Ooh9mjiE
CLUSTERING
 Cluster analysis is the process of separating data into groups, as

in classification.
 In classification, classes are predetermined (labled), while in

clustering, classes are not predetermined.
 The groups/clusters, and even how many different groups the

data will be divided into, are determined by the similarity of the
data to each other.
UNSUPERVISED LEARNING - CLUSTERING
 Parameters used:
 Distance and Similarity Measures.
 Cat Dog Example
 There is no concept of class!
 Areas of use
 Customer segmentation,
 DNA Identification
 Finger print,
 Face or Voice Recognition
Product Recommendation
Demand-based pricing policies
Customer segmentation
How often, how much purchases
which product how many times viewed
How long it took to review this product
KÜMELEME ÖRNEĞİ
CLUSTER ANALYSIS
 Clustering is an unsupervised learning approach applied to reveal

hidden patterns.
 Learning in clustering is unsupervised learning (no predetermined
classes).
 For example, in the database, when there is no information about
whether there is a woman or a man next to the records, rule
extraction is unsupervised learning. Since there will not be a label or
class like female/male here, clustering is done according to the
similarity or distance criterion between the records.
CLUSTERING
Clustering is the process of dividing data into clusters in the correct ratio with their similarities.
similarity dissimilarity
AREAS WHERE CLUSTERING IS USED
 In Computer Science,
Shopping Statistics
Patterns of  voice, character, image recognition and machine learning
Groups
 In statistics,
Astronomy
Biology  multivariate statistical estimation and pattern recognition
issues
Different Machine  Searching Web Pages over the Internet

Customer
Groups Learning  DNA analysis
Geographic
Layout  Geographic Information Systems
CLUSTERING
In clustering,
similarities are revealed by using various distance measures.
«People living in Izmit region are more

similar to people living in the Black Sea
region than people living in Yalova in
terms of clothing preferences.»
HOW MANY CLUSTERS ARE THERE?
CLUSTER ANALYSIS
 A distance measurement to determine whether a group of points is a cluster

 If the distance between two points is represented as dis(x,y), dis(x, y) gives us the
distance between x and y.
 If dist(x, y) = 0, the distance between two points is zero.
 If dist(x, y) = dist(y, x), the distance is symmetrical.
 If dist(x, y) = dist(x,z) + dist(z, y) then there is a triangle inequality.
CLUSTERING MODEL
DATA PREPARATION
In terms of the accuracy of the results, it is necessary to pre-process the data in order to minimize the
errors.
Dimensions reduction Missing and noisy Removing Statistical

value processing duplicate values normalization
CLUSTERING WORKFLOW
Data Similarity Clustering Measuring cluster Interpreting the

Preparation Measurement Algorithm quality results
SIMILARITY AND DISTANCE METRICS
• In order for a clustering algorithm to group data, it must know what pairs of similar samples are.
• You measure similarity between samples by creating a similarity metric.
 euclidean distance :
 Minkowski Distance:
 Manhattan Distance:
 Hillbert Distance
euclidean distance
DISTANCE
in a database that we will denote as D.:
D = { X 1 , X 2 , X 3 ,..., X n }
n = 1, 2 …m
Distance between Xm and Xj
When X = {x1 , x2 ...xi }
Euclid distance is calculated as:

n
dis( X m , X j ) =  mi ji
( x
i =1
− x ) 2
EXAMPLE
𝑋𝑚 = {2,4,7}
𝑋𝑗 = {1,6,5}
𝑛
𝐷𝑖𝑠𝑡(𝑋𝑚 , 𝑋𝑗 ) = ෍(𝑥𝑚𝑖 − 𝑥𝑗𝑖 )2 = 1−2 2 + 4−6 2 + 7−5 2
𝑖=1
= −1 2 + −2 2 + 2 2 = 9 = 2.236
EXAMPLE
X m = {6,5,5}
Calculate whether point Xm is closer to Xj or Xk using
𝑋𝑗 = {5,4,1} the Euclidean distance.
X k = {5,3,4}.
𝑛
𝑑𝑖𝑠𝑡(𝑋𝑚 , 𝑋𝑗 ) = ෍(𝑥𝑚𝑖 − 𝑥𝑗𝑖 )2 = 6−5 2 + 5−4 2 + 5−1 2 = 18 = 4.24

𝑖=1
𝑑𝑖𝑠𝑡(𝑋𝑚 , 𝑋𝑘 ) = ෍(𝑥𝑚𝑖 − 𝑥𝑗𝑖 )2 = 6−5 2 + 5−3 2 + 5−4 2 = 6 = 2.45

𝑖=1
SIMILARITY
2 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐷𝐼𝐶𝐸 = 𝑛 𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐽𝐴𝐶𝐶𝐴𝑅𝐷 = 𝑛
2
σ𝑖=1 𝑥𝑚𝑖 + σ𝑛𝑖=1 𝑥𝑗𝑖2 2
σ𝑖=1 𝑥𝑚𝑖 + σ𝑛𝑖=1 𝑥𝑗𝑖2 − σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖
σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐶𝑂𝑆𝐼𝑁𝐸 = 𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝑂𝑉𝐸𝑅𝐿𝐴𝑃 =
σ𝑛𝑖=1 𝑥𝑚𝑖
2 σ𝑛𝑖=1 𝑥𝑗𝑖2 min σ𝑛𝑖=1 𝑥𝑚𝑖
2 σ𝑛
, 𝑖=1 𝑥𝑗𝑖2
EXAMPLE
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
2 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐷𝐼𝐶𝐸 = 𝑛 2
σ𝑖=1 𝑥𝑚𝑖 + σ𝑛𝑖=1 𝑥𝑗𝑖2 2  81
= 0.958
104 + 65
EXAMPLE - JACCARD
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐽𝐴𝐶𝐶𝐴𝑅𝐷 = 𝑛 81
2
σ𝑖=1 𝑥𝑚𝑖 + σ𝑛𝑖=1 𝑥𝑗𝑖2 − σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖 = 0.92
104 + 65 − 81
EXAMPLE - COSINE
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}
𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝐶𝑂𝑆𝐼𝑁𝐸 =
81
σ𝑛𝑖=1 𝑥𝑚𝑖
2 σ𝑛 2
𝑖=1 𝑥𝑗𝑖 = 0.985
104.65
EXAMPLE
Xm = {1, 3, 2, 4, 5, 7}
Xj = {0, 2, 2, 4, 4, 5}

𝑆𝑖𝑚(𝑥𝑚 , 𝑥𝑗 )𝑂𝑉𝐸𝑅𝐿𝐴𝑃 = 81
min σ𝑛𝑖=1 𝑥𝑚𝑖
2 σ𝑛
, 𝑖=1 𝑥𝑗𝑖2 = 1.24
min (104,65)
DISTANCE CALCULATION METHODS BETWEEN CLUSTERS
• When a data is included in the cluster K as a result of the calculations, how much
will the distance or similarity of this cluster K change to other clusters?
• Has this cluster K reached sufficient size?
• If this cluster is divided into subclusters within itself, will these new subclusters be
really different from each other?
N
 ( xmi − X 0 ) 2
i =1
N Radius = R =
 x mi N
N N
 ( x mi − x mj ) 2
i =1
Centroid =X 0 = i =1 j =1
N Diameter =D =
N ( N − 1)
DISTANCE CALCULATION METHODS BETWEEN CLUSTERS
The Euclidean distance between two clusters (K1 and K2) with centers X01 and X02
can be calculated as follows.
𝐶𝑒𝑛𝑡𝑒𝑟_𝑑𝑖𝑠𝑡(𝐾1 , 𝐾2 ) = (𝑋01 − 𝑋02 )2

EXAMPLE – CLUSTER PROXIMITY
The table below shows the main mass consisting of a total of 15 elements divided into K1 with 10
elements and K2 with 5 elements.
Cluster K1
mass Cluster K2
ID A VALUE B VALUE
ID A VALUE B VALUE
1 3 4
1 4 7
2 5 5
2 5 5
3 4 4
3 6 9
4 3 5
5 2 4 4 7 6
6 3 6 5 5 10
7 2 4
CENTER 5.4 7.4
8 3 4
RADIUS 1.04 3.44
9 3 3
10 2 3
DISTANCE ( K1 , K 2 ) = ( X 01 − X 02 ) 2
CENTER 3 4.2
RADIUS 0.8 0.76 𝐶𝑒𝑛𝑡𝑒𝑟_𝑑𝑖𝑠𝑡(𝐾1 , 𝐾2 ) = (3 − 5.4)2 + (4.2 − 7.4)2 = 16 = 4
EXAMPLE
K2 CLUSTER
K1 CLUSTER
ID A VALUE B VALUE
ID A VALUE B VALUE
1 4 7
1 3 4
2 5 5
2 5 5
3 6 9
3 4 4
4 7 6
4 3 5
5 5 10
5 2 4
6 3 6
CENTER 5.4 7.4
7 2 4
8 3 4 RADIUS 1.04 3.44
9 3 3
10 2 3
CENTER 3 4.2 DISTANCE ( K1 , K 2 ) = ( X 01 − X 02 ) 2

RADIUS 0.8 0.76
Radius_𝑑𝑖𝑠𝑡(𝐾1 , 𝐾2 ) = (0.8 − 1.04)2 + (0.76 − 3.44)2 = 2.69

CLUSTERING ALGORITHMS
Partitional Hierarchical Fuzzy Grid-Based Density-Based

Clustering Clustering Clustering Clustering Clustering
Algorithms Algorithms Algorithms Algorithms
K-means BIRCH STING DBSCAN

K-medoids CURE CLIQUE OPTICS
PAM SLINK DENCLUE
CLARA
K-MEANS ALGORITHM
The way the k-means algorithm works is as follows:

1. Specify the number of clusters K.
2. Initialize centroids by first mixing the data cluster and then randomly
selecting K data points for centroids without changing it.
3. Keep repeating until there are no changes in the centers.
4. Calculate the sum of the square of the distance between the data
points and all centers.
5. Assign each data point to the nearest centroid.
6. Calculate the centroids for the clusters by averaging all the data points
from each cluster.
https://www.youtube.com/watch?v=4b5d3muPQmA
Fuzzy C-Means (FCM) Algorithm
 Fuzzy FCM allows objects to belong to two or more clusters.
 When aggregation or partitioning methods are used, data X1 will only be in cluster
1 and X2 will be in cluster 2. However, in the FCM algorithm, a data can be in both
cluster 1 and 2.
 Like K-means, DBSCAN algorithms, FCM uses distance measures.
 Therefore, belonging to a cluster is equivalent to being close to that cluster.
 The membership value for X1 can be to cluster 1, m11=0.73, and the membership
value of X1 to cluster 2, m12=0.27. Similarly, membership values of X2 data; cluster 2,
m22=0.2 and cluster1, m21=0.8.
Fuzzy C-Means (FCM) Algorithm
 Each data has a membership value of [0,1] to each of the clusters.
 The sum of the membership values of a data to all classes must be "1".
 The clustering process is completed when the objective function converges to the
determined minimum progress value.
 The FCM aims to minimize an objective function:
. 𝑗𝑚 = σ𝑛𝑖=1 σ𝑐𝑗=1 𝑢 𝑚 𝑋𝑖 − 𝐶j 2
, 1 ≤𝑚 ≤ ∞
𝑖𝑗 .
FUZZY C-MEANS (FCM) ALGORİTHM
 The algorithm is started by randomly assigning the U membership matrix. In the

second step, the center vectors are calculated.
𝑚
σ𝑁
𝑖=1 𝑢 𝑖𝑗 𝑥𝑖
𝐶𝑗 =
σ𝑛𝑖=1 𝑢 𝑚
𝑖𝑗
 U is recalculated using the following equation. The old U matrix and the new U
matrix are compared and the process continues until the difference is less than ε.
1
U𝑖𝑗 = 𝑋𝑖 −𝐶𝑖 2
σ𝑐𝑘=1 𝑋𝑖 −𝐶𝑘 ൗ(𝑚−1)
 As a result of the clustering process, the U membership matrix, which contains

fuzzy values, reflects the result of the clustering.
FUZZY C-MEANS
(FCM)
ALGORITHM
FUZZY C-MEANS EXAMPLE
 X=[3 4 5 17 18 19] and the number of clusters C=2

 First, randomly select U
0.1 0.2 0.6 0.3 0.1 0.5 N
U=
0.9 0.8 0.4 0.7 0.9 0.5  ij xi
u m
cj = i =1
 Centroid cj , m=2 N
 c1=13.16; c2=11.81
u
i =1
m
ij
 Calculate new membership values,i,uij 1

uij = 2
 New U: C
 || xi − c j ||  m −1
0.43 0.38 0.24 0.65 0.62 0.59  

k =1  || xi − ck || 

U=
0.57 0.62 0.76 0.35 0.38 0.41
 Repeat the process until change in membership values (for example) is less than
0.01
FUZZY C-MEANS EXAMPLE
▪ Final state of Uij values;
0.69 0.72 0.75 0.25 0.28 0.30

0.31 0.28 0.25 0.75 0.72 0.70
▪ When we take the fuzziness value of 'm' as 1, there will be no fuzzy and the U matrix will be;
0 0 0 1 1 1
1 1 1 0 0 0
INDEX AND ALGORITHM FOR CALCULATING THE
OPTIMUM NUMBER OF CLUSTERS
• As a requirement of unsupervised learning, algorithms should not be given a class at

the beginning, nor should the number of clusters be given.
• However, many algorithms require the user to enter the initial number of clusters.
• In this case, the user must either determine the optimum number of clusters by trial
and error, or perform a number of tests after each clustering and calculate which
one yields more efficient results.
• for optimum number of clusters Dunn, Davies-Bouldin, Silhouette, C and

Jaccard Index
MEASURING CLUSTERING QUALITY – INTERNAL INDEXES
• max
Silhouette
• max
Dunn
• max
Calinski-Harabasz Indexes used to measure the success
of clustering analyzes
• min
Davies-Bouldin
• min
Xie-Beni
XIE- BENI INDEX
Let's calculate the XIE Beni index value of the data cluster that we have divided into two clusters.
Our data cluster = {3,7,10,17, 18, 20}

Ui
X DISTANCE TO M1 DISTANCE TO M2
3 0.9519 0.0481
2 7 0.9974 0.0026
σ𝐾 𝑁 2
𝑖 σ𝑗 𝑈𝑖𝑗 ฮ𝑥𝑗 − 𝑚𝑖 ԡ /𝑁
𝑋𝐵 = 10 0.842 0.158
2
𝑚𝑖𝑛𝑖≠𝑗 ԡ𝑚𝑖 −𝑚𝑗 ฮ 17 0.0137 0.9863
18 0.0004 0.9996
20 0.0164 0.9836
XIE- BENI INDEX
Clusters Means (Centroid)

M1 = 6.4287 M2 = 18.2453
Each X value is subtracted from each nested M cluster mean and squared, this value is multiplied by
Uij and the total is calculated at the bottom of the table.
0.9519 X( 3 -6.4287)^2 0.0481X( 3 -18.2453)^2
0.9974 X (7 -6.4287)^2 0.0026 X( 7 -18.2453)^2
0.042X (10 -6.4287)^2 0.158 X( 10 -18.2453)^2
0.0137 X (17 -6.4287)^2 0.9863 X( 17 -18.2453)^2
0.0004X (18 -6.4287)^2 0.9996X( 18 -18.2453)^2
0.0164X (20 -6.4287)^2 0.9836 X( 20 -18.2453)^2
TOTAL: 16.65685234 TOtal: 26.86792686
Sum of totals= 16.65685234 + 26.86792686 = 43.5247792
Numerator= 43.5247792 / 6 = 7.254129867

XIE- BENI INDEX
Denominator calculation;
2 2
𝑚𝑖𝑛𝑖≠𝑗 ԡ𝑚𝑖 −𝑚𝑗 ฮ = 6.4287 − 18.2453 =139.6320356
In this case,
2
σ𝐾 𝑁 2
𝑖 σ𝑗 𝑈𝑖𝑗 ฮ𝑥𝑗 −𝑚𝑖 ԡ /𝑁 7.254129867
𝑋𝐵 = 2 = = 0.051951759
𝑚𝑖𝑛𝑖≠𝑗 ԡ𝑚𝑖 −𝑚𝑗 ฮ 139.6320356
PARTITION COEFFICIENT
This value calculates a validity value using only the membership matrix.
1 𝑐
𝑃𝐶 = σ𝑖=1 σ𝑛𝑗=1 𝑈𝑖𝑗
2 u matix
𝑁 Cluster 1 Cluster 2 u^2 cluster 1 u^2 cluster 2
0.9519 0.0481 ------> 0.90611361 0.00231361
0.9974 0.0026 ------> 0.99480676 0.00000676
0.842 0.158 ------> 0.708964 0.024964
0.0137 0.9863 ------> 0.00018769 0.97278769
0.0004 0.9996 ------> 0.00000016 0.99920016
0.0164 0.9836 ------> 0.00026896 0.96746896
Total: 2.61034118 2.96674118
Genel Total 5.57708236

Divide by
the number
This value is close to 1.0, the
of data: 0.929513727
fuzzy clustering quality is
good.
SILHOUETTE VALIDATION METHOD
▪ The Silhouette validity index calculates the Silhouette width for each sample, the
average Silhouette for each cluster, and the average Silhouette width for the entire
data cluster.
▪ As with the validity indices, it aims to capture the situations where the
homogeneity within the cluster is high and the clusters are disconnected from
each other.
𝑏 𝑖 −𝑎 𝑖
𝑆(𝑖) =
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖
▪ As the silhouette value approaches 1, the cluster quality is higher.

▪ If the silhouette value is -1, then the element is in the wrong cluster.
▪ The quality of the cluster is found by averaging the S(i) values of each point.
▪ The higher this average value, the higher the cluster quality.
EXAMPLE
A 3- variable data consists of six records Euclidean distance between records

A B C
Record Record Record Record Record Record
Record 1 27 3 94 1 2 3 4 5 6
Record 2 39 5 85 Record 1 0.0
Record 3 23 6 100
Record 2 15.1 0.0
Record 3 7.8 22.0 0.0
Record 4 36 4 190
Record 4 96.4 105.0 91.0 0.0
Record 5 39 6 182 Record 5 88.9 97.0 83.5 8.8 0.0
Record 6 30 5 178 Record 6 84.1 93.4 78.3 13.5 9.9 0.0
According to the condition T<=16, it is divided into two sets as A= {1,2,3} and B =
{4,5,6}. Evaluation of cluster A of the data cluster with Silhouette Validation Method
EXAMPLE
a(i) of record 1 = (15.1 +7.8)/2 = 11.45

b(i) of record 1 =(96.4+ 88.9 +84.1) / 3 = 89.8
𝑏 𝑖 −𝑎 𝑖 89.8 − 11.45
𝑆(𝑟𝑒𝑐𝑜𝑟𝑑 1) = = = 0.872
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖 89.8
a(i) of record 2 = (15.1 + 22.0 ) / 2 = 18.55
b(i) of record 2 = (105 + 97 + 93.4 ) /3 = 98.4
𝑏 𝑖 −𝑎 𝑖 98.4 − 18.55
𝑆(𝑅𝑒𝑐𝑜𝑟𝑑 2) = = = 0.812
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖 98.4
EXAMPLE
a(i) for record31 = (7.8 + 22 ) / 2 = 14.9
b(i) for record 3 = (91 + 83.5 + 78.3 ) / 3 = 84.2
𝑏 𝑖 −𝑎 𝑖 84.2 − 14.9
𝑆(𝑅𝑒𝑐𝑜𝑟𝑑 3) = = = 0.823
𝑚𝑎𝑥 𝑎 𝑖 , 𝑏 𝑖 84.2
all values are very close to +1.

the mean of all three values is 0.835.
In this case, cluster A is said to be of high quality.

Week 9 - Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 9 - Clustering

Uploaded by

Copyright:

Available Formats

MANAGEMENT INFORMATION SYSTEMS

DATA MINING AND BUSINESS INTELLIGENCE

K Fold Cross Validation

Predicted // Actual (Actual) Positive (Actual) Negative

(Predicted) Positive True Positive False Positive

(Predicted) Negative False Negative True Negative

Sepsis prediction distribution on a test datacluster with a total of 1,250 data

In fact, a total of 800 patients have sepsis, 450 patients do not.

SEPSIS NOT SEPSIS TOTAL

 Cluster analysis is the process of separating data into groups, as

 In classification, classes are predetermined (labled), while in

 The groups/clusters, and even how many different groups the

Demand-based pricing policies

 Clustering is an unsupervised learning approach applied to reveal

Different Machine  Searching Web Pages over the Internet

«People living in Izmit region are more

 A distance measurement to determine whether a group of points is a cluster

Dimensions reduction Missing and noisy Removing Statistical

Data Similarity Clustering Measuring cluster Interpreting the

• You measure similarity between samples by creating a similarity metric.

Distance between Xm and Xj

When X = {x1 , x2 ...xi }

Euclid distance is calculated as:

𝐷𝑖𝑠𝑡(𝑋𝑚 , 𝑋𝑗 ) = ෍(𝑥𝑚𝑖 − 𝑥𝑗𝑖 )2 = 1−2 2 + 4−6 2 + 7−5 2

𝑑𝑖𝑠𝑡(𝑋𝑚 , 𝑋𝑗 ) = ෍(𝑥𝑚𝑖 − 𝑥𝑗𝑖 )2 = 6−5 2 + 5−4 2 + 5−1 2 = 18 = 4.24

𝑑𝑖𝑠𝑡(𝑋𝑚 , 𝑋𝑘 ) = ෍(𝑥𝑚𝑖 − 𝑥𝑗𝑖 )2 = 6−5 2 + 5−3 2 + 5−4 2 = 6 = 2.45

2 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

2 σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

σ𝑛𝑖=1 𝑥𝑚𝑖 𝑥𝑗𝑖

𝐶𝑒𝑛𝑡𝑒𝑟_𝑑𝑖𝑠𝑡(𝐾1 , 𝐾2 ) = (𝑋01 − 𝑋02 )2

CENTER 3 4.2 DISTANCE ( K1 , K 2 ) = ( X 01 − X 02 ) 2

Radius_𝑑𝑖𝑠𝑡(𝐾1 , 𝐾2 ) = (0.8 − 1.04)2 + (0.76 − 3.44)2 = 2.69

Partitional Hierarchical Fuzzy Grid-Based Density-Based

K-means BIRCH STING DBSCAN

The way the k-means algorithm works is as follows:

 Fuzzy FCM allows objects to belong to two or more clusters.

 Like K-means, DBSCAN algorithms, FCM uses distance measures.

 Therefore, belonging to a cluster is equivalent to being close to that cluster.

 Each data has a membership value of [0,1] to each of the clusters.

 The FCM aims to minimize an objective function:

 The algorithm is started by randomly assigning the U membership matrix. In the

 As a result of the clustering process, the U membership matrix, which contains

 X=[3 4 5 17 18 19] and the number of clusters C=2

 Calculate new membership values,i,uij 1

0.43 0.38 0.24 0.65 0.62 0.59  

▪ Final state of Uij values;

0.69 0.72 0.75 0.25 0.28 0.30

• As a requirement of unsupervised learning, algorithms should not be given a class at

• for optimum number of clusters Dunn, Davies-Bouldin, Silhouette, C and

Our data cluster = {3,7,10,17, 18, 20}

Clusters Means (Centroid)

Sum of totals= 16.65685234 + 26.86792686 = 43.5247792

Numerator= 43.5247792 / 6 = 7.254129867

Total: 2.61034118 2.96674118

Genel Total 5.57708236

▪ As the silhouette value approaches 1, the cluster quality is higher.

A 3- variable data consists of six records Euclidean distance between records

a(i) of record 1 = (15.1 +7.8)/2 = 11.45

a(i) of record 2 = (15.1 + 22.0 ) / 2 = 18.55

b(i) of record 2 = (105 + 97 + 93.4 ) /3 = 98.4

a(i) for record31 = (7.8 + 22 ) / 2 = 14.9