Professional Documents
Culture Documents
Anaum Hamid
https://sites.google.com/site/anaumhamid/data-mining/lectures
Gentle Reminder
2. k-Mean Clustering
3. Hierarchical Clustering
4
What is Cluster Analysis?
5
What is Cluster Analysis?
v By Family
6
What is Cluster Analysis?
v By Age
7
What is Cluster Analysis?
v By Gender
8
Applications of Cluster Analysis
v Data reduction
§ Summarization: Preprocessing for regression, PCA,
classification, and association analysis
§ Compression: Image processing: vector quantization
v Hypothesis generation and testing
v Prediction based on groups
§ Cluster & find characteristics/patterns for each group
v Finding K-nearest Neighbors
§ Localizing search to one or a small number of clusters
v Outlier detection: Outliers are often viewed as those “far
away” from any cluster
9
Clustering: Application Examples
v Biology: taxonomy of living things: kingdom, phylum,
class, order, family, genus and species
v Information retrieval: document clustering
v Land use: Identification of areas of similar land use in
an earth observation database
v Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
v City-planning: Identifying groups of houses according
to their house type, value, and geographical location
v Earth-quake studies: Observed earth quake
epicenters should be clustered along continent faults
v Climate: understanding earth climate, find patterns of
atmospheric and ocean
v Economic Science: market research
10
Earthquakes
11
Image Segmentation
12
Basic Steps to Develop a Clustering
Task
v Feature selection
§ Select info concerning the task of interest
§ Minimal information redundancy
v Proximity measure
§ Similarity of two feature vectors
v Clustering criterion
§ Expressed via a cost function or some rules
v Clustering algorithms
§ Choice of algorithms
v Validation of the results
§ Validation test (also, clustering tendency test)
v Interpretation of the results
§ Integration with applications
13
Quality: What Is Good Clustering?
Inter-cluster
Intra-cluster distances distances are
are minimized maximized
14
Measure the Quality of
Clustering
v Dissimilarity/Similarity metric
§ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
§ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
§ Weights should be associated with different variables
based on applications and data semantics
v Quality of clustering:
§ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
§ It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
15
Data Structures
v Clustering algorithms typically operate on either
of the following two data structures:
v Data matrix ⎡ x11 ... x1f ... x1p ⎤
⎢ ⎥
§ (two modes) ⎢ ... ... ... ... ... ⎥
⎢x ... xif ... xip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦
⎡ 0 ⎤
v Dissimilarity matrix ⎢ d(2,1)
⎢ 0 ⎥
⎥
§ (one mode) ⎢ d(3,1) d ( 3,2) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦
16
Data Matrix
⎡ x11 ... x1f ... x1p ⎤
⎢ ⎥
⎢ ... ... ... ... ... ⎥
n objects
⎢x ... xif ... xip ⎥
⎢ i1 ⎥ p variables
⎢ ... ... ... ... ... ⎥
⎢x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦
18
Type of data in clustering
analysis
v Interval-scaled variables
v Binary variables
v Nominal, ordinal, and ratio variables
v Variables of mixed types
Interval-valued variables
v Standardize data
§ Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m = 1 (x + x + ... + x )
f n 1f 2 f nf
.
d (i, j) =| x − x | + | x − x | +...+ | x − x | 6
i1 j1 i2 j2 ip jp
xj(7,1)
Euclidean Distance
Xi (1,7)
v If q = 2, d is Euclidean distance:
8.48
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
§ Properties
Xj(7,1)
• d(i,j) ≥ 0
• d(i,i) = 0 i j
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
k
v Partitioning criteria
§ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
v Separation of clusters
§ Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than
one class)
v Similarity measure
§ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
v Clustering space
§ Full space (often when low dimensional) vs. subspaces (often
in high-dimensional clustering)
30
Requirements and Challenges
v Scalability
§ Clustering all the data instead of only on samples
v Ability to deal with different types of attributes
§ Numerical, binary, categorical, ordinal, linked, and mixture of
these
v Constraint-based clustering
§ User may give inputs on constraints
§ Use domain knowledge to determine input parameters
v Interpretability and usability
v Others
§ Discovery of clusters with arbitrary shape
§ Ability to deal with noisy data
§ Incremental clustering and insensitivity to input order
§ High dimensionality
Major Clustering Approaches 1
v Partitioning approach:
§ Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
§ Typical methods: k-means, k-medoids, CLARANS
v Hierarchical approach:
§ Create a hierarchical decomposition of the set of data
(or objects) using some criterion
§ Typical methods: Diana, Agnes, BIRCH, CHAMELEON
v Density-based approach:
§ Based on connectivity and density functions
§ Typical methods: DBSACN, OPTICS, DenClue
v Grid-based approach:
§ based on a multiple-level granularity structure
§ Typical methods: STING, WaveCluster, CLIQUE
32
Major Clustering Approaches 2
v Model-based:
§ A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
§ Typical methods: EM, SOM, COBWEB
v Frequent pattern-based:
§ Based on the analysis of frequent patterns
§ Typical methods: p-Cluster
v User-guided or constraint-based:
§ Clustering by considering user-specified or application-specific
constraints
§ Typical methods: COD (obstacles), constrained clustering
v Link-based clustering:
§ Objects are often linked together in various ways
§ Massive links can be used to cluster objects: SimRank, LinkClus
33
Partitioning Algorithms: Basic
Concept
v Partitioning method: Partitioning a database D of n
objects into a set of k clusters, such that the sum of
squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E = Σik=1Σ p∈Ci (d ( p, ci ))2
v Given k, find a partition of k clusters that optimizes the
chosen partitioning criterion
§ Global optimal: exhaustively enumerate all partitions
§ Heuristic methods: k-means and k-medoids algorithms
§ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
§ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
34
The K-Means Clustering Method
36
k-Means Algorithms steps
1. Selecting the desired number of clusters k
2. Initialization k cluster centers (centroid) random
3. Place any data or object to the nearest cluster. The proximity of
the two objects is determined based on the distance. The
distance used in k-means algorithm is the Euclidean distance (d)
n
2
d Euclidean (x, y ) = ( x
∑ i i− y )
i =1
SSE = ∑ ∑ d ( p, mi )
i =1 p∈Ci
Cluster1 Cluster 2
SSE = [(1-1)2+(3-1)2]+ [(1-1)2+(2-1)2]+ [(1-1)2+(1-1)2] (1, 3) (3, 3)
+ [(3-2)2+(3-1)2]+ [(4-2)2+(3-1)2]+ [(5-2)2+(3-1)2] (1, 2) (4, 3)
+ [(4-2)2+(2-1)2]+ [(2-2)2+(1-1)2] (1, 1) (5, 3)
(4, 2)
SSE = 4 + 1 + 0 + 5 + 8 + 13 + 5 + 0 = 36 (2, 1)
MSE = 36/8 = 4.5 -------- --------
(3/3, 6/3) (18/5, 12/5)
(1, 2) (3.6, 2.4)
[STEP 5] The new centroid for
cluster1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2) and
cluster2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
Exercise – Iteration 2
[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).
m1 m2
Cluster
Instances x y (1,2) (3.6,2.4) Membership
A 1 3 1 3.2 m1
B 3 3 3 1.2 m2
C 4 3 4 1 m2
D 5 3 5 2 m2
E 1 2 0 3 m1
F 4 2 3 0.8 m2
G 1 1 1 4 m1
H 2 1 2 3 m1
Exercise – Iteration 2
SSE = ∑ ∑ d ( p, mi )
i =1 p∈Ci
[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).
m1 m2
Cluster
Instances x y (1.25,1.75) (4, 2.75) Membership
A 1 3 1.5 3.25 m1
B 3 3 3 1.25 m2
C 4 3 4 0.25 m2
D 5 3 5 1.25 m2
E 1 2 0.5 3.75 m1
F 4 2 3 0.75 m2
G 1 1 1 4.75 m1
H 2 1 1.5 3.75 m1
Exercise – Iteration 3
SSE = ∑ ∑ d ( p, mi )
i =1 p∈Ci
44
Comments on the K-Means Method
v Strength:
§ Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
§ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
v Comment: Often terminates at a local optimal
v Weakness
§ Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
§ Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
§ Sensitive to noisy data and outliers
§ Not suitable to discover clusters with non-convex shapes
45
Variations of the K-Means Method
47
PAM: A Typical K-Medoids
Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7
Arbitrary
7
Assign
7
6 6 6
5
choose k 5
each 5
4 object as 4 remainin 4
3 initial 3 g object 3
2
medoids 2
to 2
1
1
nearest
1
0
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 medoids 0 1 2 3 4 5 6 7 8 9 10
total cost of
Until no
7 7
If quality is 4
4
improved. 3
3
2
2
1
1
0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
48
The K-Medoid Clustering Method
v K-Medoids Clustering: Find representative objects
(medoids) in clusters
v PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
§ Starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it
improves the total distance of the resulting clustering
§ PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
v Efficiency improvement on PAM
§ CLARA (Kaufmann & Rousseeuw, 1990): PAM on
samples
§ CLARANS (Ng & Han, 1994): Randomized re-sampling
49
Hierarchical Clustering
Dendrogram: Shows How Clusters
are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
0.2
0.15
0.1
0.05
0
3 6 4 1 2 5
Distance between Clusters X X
v Density-connected p q
§ A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both, p
and q are density-reachable from
o w.r.t. Eps and MinPts 54
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
v Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
v Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
55
DBSCAN: The Algorithm
56
DBSCAN: Sensitive to
Parameters
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
57
Density-Based Clustering
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/
Demo
58
Grid-Based Clustering Method
60
Measuring Clustering Quality:
Extrinsic Methods
v Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
v Q is good if it satisfies the following 4 essential criteria
§ Cluster homogeneity: the purer, the better
§ Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
§ Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
§ Small cluster preservation: splitting a small category into
pieces is more harmful than splitting a large category
into pieces
61
Summary
v Cluster analysis groups objects based on their similarity and has
wide applications
v Measure of similarity can be computed for various types of data
v Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
v K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
v Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
v DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
v STING and CLIQUE are grid-based methods, where CLIQUE is
also a subspace clustering algorithm
v Quality of clustering results can be evaluated in various ways 62
Summary
Clustering General Characteristics
Method
Partitioning o Find mutually exclusive clusters of spherical shape
Methods o Distance-based
o May use mean or medoid (etc.) to represent cluster center
o Effective for small to medium size data sets
Hierarchical o Clustering is a hierarchical decomposition (i.e., multiple levels)
Methods o Cannot correct erroneous merges or splits
o May incorporate other techniques like microclustering or consider
object “linkages”
Density-based o Can find arbitrarily shaped clusters
Methods o Clusters are dense regions of objects in space that are separated
by low-density regions
o Cluster density: Each point must have a minimum number of
points within its “neighborhood”
o May filter out outliers
Grid-based o Use a multiresolution grid data structure
Methods o Fast processing time (typically independent of the number of data
63
objects, yet dependent on grid size)
Reference
1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications, CRC Press
Taylor & Francis Group, 2014
4. Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6. Florin Gorunescu, Data Mining: Concepts, Models and
Techniques, Springer, 2011
7. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Advances in Data Mining of Enterprise Data: Algorithms and
Applications, World Scientific, 2007
64