Agrupamento 19-10-22

Data Mining:
Concepts and Techniques

(3rd ed.)
— Chapter 10 —
Slides adapted from:

Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Evaluation of Clustering
■ Summary
2
What is Cluster Analysis?
■ Cluster: A collection of data objects
■ similar (or related) to one another within the same group
■ dissimilar (or unrelated) to the objects in other groups
■ Cluster analysis (or clustering, data segmentation, …)

■ Finding similarities between data according to the
characteristics found in the data and grouping similar

data objects into clusters
■ Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
■ Typical applications
■ As a stand-alone tool to get insight into data distribution
■ As a preprocessing step for other algorithms
3
Clustering for Data Understanding and
Applications
■ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
■ Information retrieval: document clustering
■ Land use: Identification of areas of similar land use in an earth
observation database
■ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
■ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
■ Earthquake studies: Observed earthquake epicenters should be
clustered along continent faults
■ Climate: understanding earth climate, find patterns of atmospheric
and ocean
■ Economic Science: market research
4
Quality: What Is Good Clustering?
■ A good clustering method will produce high quality

clusters
■ high intra-class similarity: cohesive within clusters
■ low inter-class similarity: distinctive between clusters
■ The quality of a clustering method depends on
■ the similarity measure used by the method
■ its implementation, and
■ Its ability to discover some or all of the hidden patterns
5
Methods
■ Grid-Based Methods
■ Summary
6
Partitioning Algorithms: Basic Concept
■ Partitioning method: Partitioning a database D of n objects into a set of

k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
■ Given k, find a partition of k clusters that optimizes the chosen

partitioning criterion
■ Global optimal: exhaustively enumerate all partitions
■ Heuristic methods: k-means and k-medoids algorithms
■ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
■ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
7
The K-Means Clustering Method
■ Given k, the k-means algorithm is implemented in four

steps:
■ Partition objects into k nonempty subsets
■ Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
■ Assign each object to the cluster with the nearest
seed point
■ Go back to Step 2, stop when the assignment does
not change
8
An Example of K-Means Clustering
K=2
Arbitrarily Update the

partition cluster
objects into centroids
k groups
The initial data set Loop if Reassign objects

needed
■ Partition objects into k nonempty
subsets
■ Repeat
■ Compute centroid (i.e., mean Update the
point) for each partition cluster
centroids
■ Assign each object to the
cluster of its nearest centroid
■ Until no change
9
Comments on the K-Means Method
■ Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
■ Comment: Often terminates at a local optimal.
■ Weakness
■ Need to specify k, the number of clusters, in advance
■ Sensitive to noisy data and outliers
■ Not suitable to discover clusters with non-convex shapes
10
What Is the Problem of the K-Means Method?
■ The k-means algorithm is sensitive to outliers !
■ Since an object with an extremely large value may substantially

distort the distribution of the data
■ K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
11
K-Medoids
12
Sugestão de videos
■ Sobre kmedoids:
■ https://youtu.be/4b5d3muPQmA
■ Sobre kmeans e kmedoids:

■ https://youtu.be/GApaAnGx3Fw
■ Importante mencionar que K-medoids não é um algoritmo apenas, mas uma

categoria de algoritmos (ex.: PAM, CLARA, CLARANS).
13
Methods
■ Summary
14
Hierarchical clustering
■ Os métodos de agrupamento hierárquicos dividem os objetos de
dados em um dendograma de grupos.
■ Eles podem ser classificados, de acordo com a sua decomposição
hierárquica, como:
■ Aglomerativos: de baixo para cima (fusão)
■ Divisivos: de cima para baixo (divisão)
15
■ Aglomerativos: inicia com cada objeto formando grupos isolados.
■ Objetos mais próximos vão sendo mesclados até que todos os
grupos estejam fundidos em um nível superior da hierarquia ou
até que uma condição de parada seja alcançada.
■ Divisivos: inicia com todos os objetos em um mesmo grupo.
■ A cada iteração, cada grupo é dividido em grupos menores até
que um objeto forme um grupo ou que uma condição de
parada seja alcançada.
16
17
■ A definição de proximidade de grupo é o que distingue as
diferentes técnicas hierárquicas aglomerativas
■ MIN a define como a distância entre os dois pontos mais
próximos em grupos distintos;
■ MAX a define como a distância entre os dois pontos mais
distantes em grupos distintos;
■ A média do grupo define a proximidade como a distância média
de todos os pares de pontos em grupos distintos.
18
■ Exemplo de método hierárquico aglomerativo:

■ Complete-Link: utiliza MAX como proximidade de grupo; ou
seja, a proximidade entre dois grupos é definida como a maior
distância entre dois pontos de grupos distintos. Trata-se de um
método menos suscetível a ruídos e outliers.
19
■ Agrupamento hierárquico
aglomerativo com
Complete-Link
■ Agrupamento hierárquico
aglomerativo com Ward
20
Sugestão de videos
■ Sobre agrupamento hierárquico:
■ https://youtu.be/ijUMKMC4f9I
■ https://youtu.be/7xHsRkOdVwo
21
Methods
■ Summary
22
Density-Based Clustering Methods
■ Clustering based on density (local cluster criterion), such

as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape
■ Handle noise
■ One scan
■ Need density parameters as termination condition
■ Several interesting studies:

■ DBSCAN: Ester, et al. (KDD’96)
■ OPTICS: Ankerst, et al (SIGMOD’99).
■ DENCLUE: Hinneburg & D. Keim (KDD’98)
■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more
grid-based)
23
Density-Based Clustering: Basic Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an
Eps-neighbourhood of that point
■ NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if
■ p belongs to NEps(q)
p MinPts = 5
■ core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
24
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
■ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
■ Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
25
DBSCAN: The Algorithm
■ Arbitrary select a point p
■ Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
■ If p is a core point, a cluster is formed
■ If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
■ Continue the process until all of the points have been
processed
26
DBSCAN: Sensitive to Parameters
27
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,

border and noise
Eps = 10, MinPts = 4

When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Sugestão de videos
■ Sobre DBSCAN:
■ https://youtu.be/RDZUdRSDOok
30
Methods
■ Summary
31
Cluster Validity
■ For supervised classification we have a variety of
measures to evaluate how good our model is
■ Accuracy, precision, recall
■ For cluster analysis, the analogous question is how to

evaluate the “goodness” of the resulting clusters?
■ But “clusters are in the eye of the beholder”!
■ Then why do we want to evaluate them?

■ To avoid finding patterns in noise
■ To compare clustering algorithms
■ To compare two sets of clusters
■ To compare two clusters
Clusters found in Random Data
Random DBSCAN
Points
K-means Complet
e Link
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,

distinguishing whether non-random structure actually exists in the
data.
2. Comparing the results of a cluster analysis to externally known
results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data
without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to

evaluate the entire clustering or just individual clusters.
Measures of Cluster Validity
■ Numerical measures that are applied to judge various aspects

of cluster validity, are classified into the following types.
■ External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
■ Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
Internal Measures: SSE
■ Clusters in more complicated figures aren’t well separated
■ Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
■ SSE
■ SSE is good for comparing two clusterings or two clusters (average
SSE).
■ Can also be used to estimate the number of clusters
Internal Measures: Cohesion and Separation
■ Cluster Cohesion: Measures how closely related

are objects in a cluster
■ Example: SSE
■ Cluster Separation: Measure how distinct or
well-separated a cluster is from other clusters
■ Example: Squared Error
■ Cohesion is measured by the within cluster sum of squares (SSE)
■ Separation is measured by the between cluster sum of squares
■ Where |Ci| is the size of cluster i

Internal Measures: Silhouette Coefficient
■ Silhouette coefficient combines ideas of both cohesion and separation,

but for individual points, as well as clusters and clusterings
■ For an individual point, i
■ Calculate a = average distance of i to the points in its cluster
■ Calculate b = min (average distance of i to points in another cluster - all of
them)
■ The silhouette coefficient for a point is then given by
s = (b – a) / max(a,b)
■ Typically between 0 and 1.

■ The closer to 1 the better.
■ Can calculate the average silhouette coefficient for a cluster or a

clustering
External Measures: Fowlkes Mallows Index - FMI
■ Permite avaliar os grupos formados considerando rótulos de classificação

já conhecidos (rótulos observados no mundo real).
■ De acordo com Dziopa [2016], a pontuação FMI é definida como a média
geométrica da precisão (precision) e revocação (recall).
Vai de 0 a 1. Valores próximos a 1
indicam que o agrupamento se
aproxima dos rótulos observados.
Final Comment on Cluster Validity
“The validation of clustering structures is the most

difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only to
those true believers who have experience and
great courage.”
Algorithms for Clustering Data, Jain and Dubes

Methods
■ Summary
41
Summary
■ Cluster analysis groups objects based on their similarity and has
wide applications
■ Measure of similarity can be computed for various types of data
■ Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
■ K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
■ Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
■ DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
■ STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
■ Quality of clustering results can be evaluated in various ways
42
References (1)
■ R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
■ M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
■ M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
■ Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
■ M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
■ M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
■ M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
■ D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
■ V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
43
References (2)
■ D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
■ S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
■ S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
■ A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
■ A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
■ G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
■ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
■ E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
44
References (3)
■ G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
■ R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
■ L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
■ E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
■ G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
■ A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
■ A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
■ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
■ W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
■ T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
■ X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06
45

Agrupamento 19-10-22

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Agrupamento 19-10-22

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Slides adapted from:

■ dissimilar (or unrelated) to the objects in other groups

■ Cluster analysis (or clustering, data segmentation, …)

characteristics found in the data and grouping similar

■ As a preprocessing step for other algorithms

■ A good clustering method will produce high quality

■ Partitioning method: Partitioning a database D of n objects into a set of

■ Given k, find a partition of k clusters that optimizes the chosen

■ Given k, the k-means algorithm is implemented in four

Arbitrarily Update the

The initial data set Loop if Reassign objects

■ The k-means algorithm is sensitive to outliers !

■ Since an object with an extremely large value may substantially

■ Sobre kmeans e kmedoids:

■ Importante mencionar que K-medoids não é um algoritmo apenas, mas uma

■ Exemplo de método hierárquico aglomerativo:

■ Clustering based on density (local cluster criterion), such

■ Several interesting studies:

■ OPTICS: Ankerst, et al (SIGMOD’99).

■ DENCLUE: Hinneburg & D. Keim (KDD’98)

■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more

Original Points Point types: core,

Eps = 10, MinPts = 4

■ For cluster analysis, the analogous question is how to

■ But “clusters are in the eye of the beholder”!

■ Then why do we want to evaluate them?

1. Determining the clustering tendency of a set of data, i.e.,

For 2, 3, and 4, we can further distinguish whether we want to

■ Numerical measures that are applied to judge various aspects

■ Cluster Cohesion: Measures how closely related

■ Separation is measured by the between cluster sum of squares

■ Where |Ci| is the size of cluster i

■ Silhouette coefficient combines ideas of both cohesion and separation,

■ Typically between 0 and 1.

■ Can calculate the average silhouette coefficient for a cluster or a

■ Permite avaliar os grupos formados considerando rótulos de classificação

“The validation of clustering structures is the most

Algorithms for Clustering Data, Jain and Dubes

You might also like