0 views

Uploaded by Vduejekf

Clustering

- DBSCAN
- IJLSM 8404 Rossetti and Achlerkar
- Nearest Neighbour and Clustering
- Local Clusters Formation for Indian
- A New Algorithm for Inferring User Search Goals With Feedback Sessions
- Thesis Final
- 2011GI03
- SURVEY OF DATA MINING TECHNIQUES USED IN HEALTHCARE DOMAIN
- Document Clustering Using Compound Words
- Process for Predicting Earthquakes Through Data Mining
- 2006may
- An Effective Algorithm for Mining and Grouping Online Transactions in Online Systems
- k 04086972
- Matt Unstable Flexible Modalities
- 155141903 Slicing a New Approach for Privacy Preserving Data Publishing
- MKT3421 Session4 Outline
- irani-2016-ijca-907841
- daweqw
- 0212csit04
- DM Journal

You are on page 1of 7

Cluster Validity

• For supervised classification we have a variety of

measures to evaluate how good our model is

•

Cluster Validation

• •

•

evaluate the “goodness” of the resulting clusters?

• Cluster validation

– Assess the quality and reliability of clustering results.

•

• Why validation?

– To avoid finding clusters formed by chance

– To compare clustering algorithms

– To choose clustering parameters

• e.g., the number of clusters in the K-means algorithm

1 1

0.9 0.9

0.8 0.8

0.7

0.6

0.7

0.6 DBSCAN

y

0.4 0.4

results. 0.3

0.2

0.3

0.2

• Why validation?

0.1 0.1

0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

x x

– To avoid finding clusters formed by chance 1 1

0.9 0.9

0.7

0.8

0.7

Complete

0.5

0.6

0.5

Link

y

y

• e.g., the number of clusters in the K-means 0.4 0.4

algorithm 0.3

0.2

0.3

0.2

0.1 0.1

0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

x x

3 4

Aspects of Cluster Validation Cluster validation process

• Cluster validation refers to procedures that evaluate the results

• Comparing the clustering results to ground truth of clustering in a quantitative and objective fashion

(externally known results). – How to be “quantitative”: To employ the measures.

– External Index

– How to be “objective”: To validate the measures!

• Evaluating the quality of clusters without reference

to external information.

– Use only the data

– Internal Index INPUT: Clustering Partitions P Validity m*

• Determining the reliability of clusters. DataSet(X) Algorithm Codebook C Index

– To what confidence level, the clusters are not formed

by chance

– Statistical framework Different number of clusters m

5 6

Internal Index:

• Validate without external info

• Solve the number of clusters ? ?

External Index Internal indexes

• Validate against ground truth

• Compare two clusters:

(how similar) ?

?

7 8

Internal indexes Mean square error (MSE)

• Minimizes (or maximizes) internal index: • The more clusters the smaller the MSE.

– Rule of thumb • Small knee-point near the correct value.

One simple rule of thumb sets the number to • But how to detect?

10

S2

9

with n as the number of objects (data points).

8

– Variances of within cluster and between clusters 7

– Rate-distortion method 6

MSE

5 Knee-point between

– F-ratio 4

14 and 15 clusters.

– Davies-Bouldin index (DBI) 3

2

– Bayesian Information Criterion (BIC)

1

– Silhouette Coefficient 0

5 10 15 20 25

9 Clusters 10

• SSW / k ---- Ball and Hall (1965)

• Minimize within cluster variance (MSE) • k2|W| ---- Marriot (1971)

• Maximize between cluster variance • SSB / k 1 ---- Calinski & Harabasz (1974)

Inter-cluster SSW / N k

Intra-cluster variance is

maximized • log(SSB/SSW) ---- Hartigan (1975)

variance is

minimized • ---- Xu (1997)

d log( SSW /(dN 2 )) log(k )

(d is the dimension of data; N is the size of data; k is the number of clusters)

SSB = Sum of squares between the clusters

11 12

Internal Measures: Cohesion and Separation Internal Measures: Cohesion

• Cluster Cohesion: Measures how closely related are objects in a and Separation

• Example: SSE

cluster

– BSS + WSS = constant

– Example: SSE m

• Cluster Separation: Measure how distinct or well-separated a

cluster is from other clusters 1 m 2 3 4 m 5

• Example: Squared Error 1 2

– Cohesion is measured by the within cluster sum of squares (SSE) K=1 WSS (1 3) 2 ( 2 3) 2 ( 4 3) 2 (5 3) 2 10

SSW ( x mi ) 2 cluster:

BSS 4 (3 3) 02

i xCi Total 10 0 10

– Separation is measured by the between cluster sum of squares

SSB C i (m mi ) 2 K=2

WSS (1 1.5) 2 (2 1.5) 2 (4 4.5)2 (5 4.5) 2 1

i BSS 2 (3 1.5) 2 2 (4.5 3)2 9

– Where |Ci| is the size of cluster i clusters: Total 1 9 10

Total Vatiance =

( X ) SSW SSB 13

• Variance-ratio F-test 1.4

against the within-groups variance (original f-test)

F-ratio (x10^5)

1.0

PNN

• F-ratio (WB-index): 0.8

IS

N 0.6

k || xi c p ( i ) ||2 minimum

k SSW 0.4

F i 1

k

( X ) SSW 0.2

n j || c j x ||2

j 1

SSB 0.0

25 23 21 19 17 15 13 11 9 7 5

Clusters

15 16

Davies-Bouldin index (DBI) Davies-Bouldin index (DBI)

• Maximize the distance between

clusters

• Cost function weighted sum of the two:

MAE j MAE k

R j ,k

d (c j , c k )

1 M

DBI max R j ,k

M j 1 j k

18

[Kaufman&Rousseeuw, 1990] [Kaufman&Rousseeuw, 1990]

We need a quantitative method to assess the quality of a clustering... • Cohesion: measures how closely related are

The silhouette value of a point is a measure of how similar a point is to points in its own

cluster compared to points in other clusters

objects in a cluster

• Separation: measure how distinct or well-

Formal definition:

separated a cluster is from other clusters

• a(i) is the average distance of the point i to the other points in its own cluster A

• d(i, C) is the average distance of the point i to the other points in the cluster C

• b(i) is the minimal d(i, C) over all clusters other than A

cohesion

separation

Silhouette coefficient Silhouette coefficient

• Cohesion a(x): average distance of x to all other vectors

in the same cluster.

• Separation b(x): average distance of x to the vectors in

x other clusters. Find the minimum among the clusters.

x • silhouette s(x):

b( x ) a ( x )

s( x)

max{a( x), b( x)}

cohesion

• s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good

a(x): average distance separation • Silhouette coefficient (SC):

in the cluster 1 N

b(x): average distances to SC

N

s( x)

i 1

others clusters, find minimal

Silhouette coefficient

24

Internal indexes Internal indexes

25

Soft partitions 26

K-means

27

- DBSCANUploaded bynilabjyaghosh
- IJLSM 8404 Rossetti and AchlerkarUploaded byaachlerk
- Nearest Neighbour and ClusteringUploaded byNatarajanSubramanyam
- Local Clusters Formation for IndianUploaded byAdam Hansen
- A New Algorithm for Inferring User Search Goals With Feedback SessionsUploaded byVinay Kumar
- Thesis FinalUploaded byLouiseBundgaard
- 2011GI03Uploaded byAshish Khandelwal
- SURVEY OF DATA MINING TECHNIQUES USED IN HEALTHCARE DOMAINUploaded byMandy Diaz
- Document Clustering Using Compound WordsUploaded bywer615371899
- Process for Predicting Earthquakes Through Data MiningUploaded byAnusha Saranam
- 2006mayUploaded byyakzak_khan
- An Effective Algorithm for Mining and Grouping Online Transactions in Online SystemsUploaded byseventhsensegroup
- k 04086972Uploaded byIJERD
- Matt Unstable Flexible ModalitiesUploaded byTeromeMcNally
- 155141903 Slicing a New Approach for Privacy Preserving Data PublishingUploaded byShreyansh
- MKT3421 Session4 OutlineUploaded byNica Policarpio
- irani-2016-ijca-907841Uploaded bydjamel
- daweqwUploaded byChillik Ardianto
- 0212csit04Uploaded byAjit Kumar
- DM JournalUploaded bySuraj Adagale
- 61_ines2012 (1)Uploaded byShafayet Uddin
- icde12mqoUploaded bySaadKhan
- clust6Uploaded byTiffany Bryan
- Clustering of Textual Data by Using K-means TechniqueUploaded byOAIJSE
- aprioriUploaded byJavier Lane
- Ga Based Neuro FuzzyUploaded bykani_suguna
- DATA GATHERING IN WIRELESS SENSOR NETWORKS USING INTERMEDIATE NODESUploaded byAIRCC - IJCNC
- 7B_3_clarichUploaded byskc3128
- Plagiarism- ReportUploaded byGanesh Peketi
- Density Based ClusteringUploaded byanon_857439662

- Accreditation of Assessment Centers QAS-031-CER-09Uploaded byDumaguit Dagohoy Victor
- Assure Lesson Plan Style ...Uploaded byJunje Hbb
- Effectiveness of Performance Appraisal System in Ensuring Smooth Promotion ProcessUploaded byJahedHossain
- Account ExecutiveUploaded byapi-76934216
- ITIL_Foundation.english_stuUploaded byPardhasarathy Nc
- slide1.pptUploaded byMahbub
- Dooley ResumeUploaded byAlexDooley
- Tinkler, Penny - Cause for Concern - Young Women and Leisure, 1930-50 (2003)Uploaded byAnonymous ongLqCEII
- Peer observation and feedback in CPD.pdfUploaded byAhmed Ra'ef
- Formative Assignment - NiBuLUploaded byddmukundan
- service hours policy 2015-16 with letterheadUploaded byapi-264240597
- Case StudyUploaded byJulie Cook
- Primal Leadership Realizing the Power of EmotionalUploaded byHarum
- Does Direct Instruction Develop Pragmatic Competence? Teaching Refusals to EFL Learners of EnglishUploaded byMajidfa92
- nicholas harrod banner ref letterUploaded byapi-317210322
- Chapter 0 - Introduction to CENG 6103Uploaded bySeble Getachew
- BGB CRE Job Description.docxUploaded byCatalina Elena
- project narrativeUploaded byapi-436588418
- wjc write-upUploaded byapi-272099906
- nadiri ve tanova 2010Uploaded byasm81
- Guidelines in the Processing of Master Teacher I and II DocumentsUploaded byAnonymous m8FFMfu
- lang.pdfUploaded byNithinNini
- 217593560-Bridging-Discourses-in-the-ESL-Classroom.pdfUploaded byChristopher Smith
- PolitologiyaUploaded byAnonymous lqJh4RtVV
- SpeakingUploaded byMai Ngọc
- Rich Dad Poor Dad Summary - Success Magazine Book SummariesUploaded byNathan
- CS2253Uploaded byvivekthesmart
- Acretion ProcessesUploaded bylackoibolya
- ap chemsitry sampleUploaded byiamamazerful111
- Practice Test Physics BUploaded byQuy Nguyen