Professional Documents
Culture Documents
DM 10,11 Clustering PDF
DM 10,11 Clustering PDF
https://sites.google.com/site/mtsiddiquecs/dm
Gentle Reminder
1. What is Cluster Analysis?
h l l
2. k‐Mean Clustering
2 k Mean Clustering
3. Hierarchical Clustering
3. Hierarchical Clustering
4. Density based Clustering
y g
5. Grid Based Clustering
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, segmentation …))
clustering data segmentation,
Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
4
What is Cluster Analysis?
5
What is Cluster Analysis?
By Family
6
What is Cluster Analysis?
By Age
7
What is Cluster Analysis?
By Gender
8
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA,
classification, and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
Finding K-nearest Neighbors
Localizing
L li i search h tto one or a smallll number
b off clusters
l t
Outlier detection: Outliers are often viewed as those “far
away” from any cluster
away
9
Clustering: Application Examples
Biology: taxonomy of living things: kingdom, phylum,
class, order, family,
y ggenus and species
p
Information retrieval: document clustering
Land use: Identification of areas of similar land use in
an earth observation database
Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop p targeted
g marketinggpprograms
g
City-planning: Identifying groups of houses according
to their house type, value, and geographical location
Earth
Earth-quake
quake studies: Observed earth quake
epicenters should be clustered along continent faults
Climate: understanding earth climate, find patterns of
p
atmospheric and ocean
Economic Science: market research
10
Earthquakes
11
Image Segmentation
12
Basic Steps to Develop a Clustering
Task
Feature selection
Select info concerning the task of interest
Minimal information redundancy
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
g
Validation of the results
Validation test (also, clustering tendency test)
Interpretation
I t t ti off the
th results
lt
Integration with applications
13
Quality: What Is Good Clustering?
Inter-cluster
IIntra-cluster
l distances
di distances are
are minimized maximized
14
Measure the Quality of
Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar
similar enough”
enough or “good
good enough”
enough
• The answer is typically highly subjective
15
Data Structures
Clustering algorithms typically operate on either
of the following two data structures:
Data matrix ⎡ x 11 ... x 1ff ... x 1pp ⎤
⎢ ⎥
(two modes) ⎢ ... ... ... ... ... ⎥
⎢x ... x if ... x ip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... x nf ... x np ⎥
⎣⎢ n1 ⎦⎥
⎡ 0 ⎤
Dissimilarity matrix ⎢ d(2,1)
⎢ 0 ⎥
⎥
(one mode) ⎢ d(3,1 ) d ( 3,2 ) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎢⎣ d ( n ,1 ) d ( n ,2 ) ... ... 0 ⎥⎦
16
Data Matrix
⎡ x 11 ... x 1f ... x 1p ⎤
⎢ ⎥
⎢ ... ... ... ... ... ⎥
n objects
⎢x ... x if ... x ip ⎥
⎢ i1 ⎥ p variables
⎢ ... ... ... ... ... ⎥
⎢x ... x nf ... x np ⎥⎥
⎢⎣ n1 ⎦
18
Type of data in clustering
analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )
.
d (i, j ) =| x − x | + | x − x | + ...+ | x − x | 6
i1 j1 i2 j2 ip jp
xj(7,1)
Euclidean Distance
Xi (1,7)
If q = 2, d is Euclidean distance:
8.48
d (i, j) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
Properties
Xj(7,1)
• d(i,j) ≥ 0
• d(i
d(i,i)
i) = 0 i j
• d(i,j) = d(j,i)
( ,j) ≤ d(i,k)
• d(i,j) ( , ) + d(k,j)
( ,j)
k
d ( i , j ) = p −p m
Ratio
Ratio-scaled
scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables— not a good
g
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as
interval-scaled
28
Variables of Mixed Types
yp
M f −1
29
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than
one class)
Similarity measure
Distance
Distance-based
based (e.g.,
(e g Euclidian,
Euclidian road network
network, vector) vs
vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often
in high-dimensional clustering)
30
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
g d
High dimensionality
e s o a ty
Major Clustering Approaches 1
Partitioning approach:
Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
Typical
yp methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data
((or objects)
j ) using
g some criterion
Typical methods: Diana, Agnes, BIRCH, CHAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple
multiple-level
level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
32
Major Clustering Approaches 2
Model-based:
A model
d l iis h
hypothesized
th i d ffor each
h off th
the clusters
l t and
d ttries
i tto fi
find
d
the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent
F t pattern-based:
tt b d
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects
j are often linked together
g in various ways
y
Massive links can be used to cluster objects: SimRank, LinkClus
33
Partitioning Algorithms: Basic
Concept
Partitioning method: Partitioning a database D of n
objects into a set of k clusters, such that the sum of
squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E = Σ ik=1Σ p∈Ci (d ( p, ci )) 2
Given k, find a partition of k clusters that optimizes the
chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: kk-means
means and kk-medoids
medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
medoids or PAM (Partition around medoids) (Kaufman &
kk-medoids
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
34
The K-Means Clustering Method
Until no change
36
k-Means Algorithms steps
1. Selecting the desired number of clusters k
2
2. Initialization k cluster centers (centroid) random
3. Place any data or object to the nearest cluster. The proximity of
the two objects is determined based on the distance. The
distance used in k-means algorithm
g is the Euclidean distance ((d))
n
d Euclidean ( x, y ) = (
∑ i i− ) 2
x y
i =1
SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i
Cluster1 Cluster 2
SSE = [(1-1)2+(3-1)2]+ [(1-1)2+(2-1)2]+ [(1-1)2+(1-1)2] (1, 3) (3, 3)
+ [(3-2)2+(3-1)2]+ [(4-2)2+(3-1)2]+ [(5-2)2+(3-1)2] (1 2)
(1, (4 3)
(4,
+ [(4-2)2+(2-1)2]+ [(2-2)2+(1-1)2] (1, 1) (5, 3)
(4, 2)
SSE = 4 + 1 + 0 + 5 + 8 + 13 + 5 + 0 = 36 (2, 1)
MSE = 36/8 = 4.5 -------- --------
(3/3, 6/3) (18/5, 12/5)
(1, 2) (3.6, 2.4)
[STEP 5] The new centroid for
cluster1 = [(
[(1 + 1 + 1)/3,
) , (3
( + 2 + 1)/3]
) ] = (1,
( , 2)) and
cluster2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
Exercise – Iteration 2
[STEP 3] Place each object to the closest cluster centroid based on the
closest
l t value
l diff
difference (di
(distance).
t )
m1 m2
Cluster
Instances x y (1,2) (3.6,2.4) Membership
A 1 3 1 3.2 m1
B 3 3 3 12
1.2 m2
C 4 3 4 1 m2
D 5 3 5 2 m2
E 1 2 0 3 m1
F 4 2 3 0.8 m2
G 1 1 1 4 m1
H 2 1 2 3 m1
Exercise – Iteration 2
SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i
[STEP 3] Place each object to the closest cluster centroid based on the
closest
l t value
l diff
difference (di
(distance).
t )
m1 m2
Cluster
Instances x y (1.25,1.75) (4, 2.75) Membership
A 1 3 1.5 3.25 m1
B 3 3 3 1 25
1.25 m2
C 4 3 4 0.25 m2
D 5 3 5 1.25 m2
E 1 2 0.5 3.75 m1
F 4 2 3 0.75 m2
G 1 1 1 4.75 m1
H 2 1 1.5 3.75 m1
Exercise – Iteration 3
SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i
44
Comments on the K-Means Method
Strength:
O(tkn) where n is # objects,
Efficient: O(tkn), objects k is # clusters,
clusters and t is #
iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal
Weakness
Applicable only to objects in a continuous n-dimensional
n dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
Need to specify k,
k the number of clusters,
clusters in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non
non-convex
convex shapes
45
Variations of the K-Means Method
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
47
PAM: A Typical K-Medoids
Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
i
improved.d 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
48
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects
(medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
St
Starts
t from
f an initial
i iti l sett off medoids
d id and
d it
iteratively
ti l
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
PAM works effectively for small data sets sets, but does not
scale well for large data sets (due to the computational
complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on
samples
CLARANS ((Ng
g & Han,, 1994):
) Randomized re-sampling
p g
49
Hierarchical Clustering
Use distance matrix as clustering criteria
This method does not require the number of clusters k
as an input, but needs a termination condition
Step
St Step
St Step
St Step
St Step
St agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0 50
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented
I l t d iin statistical
t ti ti l packages,
k e.g., Splus
S l
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventuallyy all nodes belong g to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
51
Dendrogram: Shows How Clusters
are Merged
A clustering
l t i off the
th data
d t objects
bj t iis obtained
bt i d bby cutting
tti the
th
dendrogram at the desired level, then each connected
component forms a cluster
0.25
0.2
0.15
0.1
0.05
0
3 6 4 1 2 5
DIANA (Divisive Analysis)
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
53
Distance between Clusters X X
Single
g link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and
an element in the other
other, ii.e.,
e dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e.,
dist(Ki, Kj) = dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
54
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a ΣiN= 1(t
Cm = ip )
cluster N
Radius:
R di square roott off
average distance from any Σ N (t − cm ) 2
point of the cluster to its Rm = i =1 ip
N
centroid
Diameter:
Di t square roott off
average mean squared Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
distance between all ppairs of N ( N −1)
points in the cluster
55
Extensions to Hierarchical
Clustering
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity
p y of at least O(n
( 2), where
n is the number of total objects
Integration of hierarchical & distance-based
distance based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
sub clusters
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
56
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
Zhang, Ramakrishnan & Livny, SIGMOD’96
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in
in-memory
memory CF tree (a multi
multi-
level compression of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
S
Scales l finds
l lilinearly: fi d a good
d clustering
l t i with ith a single
i l scan and
d
improves the quality with a few additional scans
Weakness: handles only
y numeric data,, and sensitive to the order of
the data record
57
Clustering Feature Vector in BIRCH
(2,6)
∑ X
9
(4,5)
8
i
i =1
7
(4 7)
(4,7)
6
2
(3,8)
1
0
0 1 2 3 4 5 6 7 8 9 10
58
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the
0-th, 1st, and 2nd moments of the subcluster from the
statistical point of view
Registers crucial measurements for computing cluster
and utilizes storage efficiently
A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor: max # of children
Threshold:
Threshold ma
max diameter of ssub-clusters
b cl sters stored at the
leaf nodes
59
The CF Tree Structure
B=7 CF1 CF2 CF3 CF6
L=6 child
hild1 child
hild2 child
hild3 child
hild6
N l f node
Non-leaf d
CF1 CF2 CF3 CF5
60
The Birch Algorithm
Cluster Diameter 1
∑ ( xi − x j )
2
n ( n − 1)
Data Set
K-NN Graph
P and q are connected Merge Partition
if q is among the top k
closest neighbors of p
Relative
e at e
interconnectivity:
connectivity of c1 and c2
Final Clusters
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 65
CHAMELEON (Clustering Complex
Objects)
66
Density-Based Clustering Methods
Density-connected p q
A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both
both, p
and q are density-reachable from
o w.r.t. Eps and MinPts 69
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
70
DBSCAN: The Algorithm
71
DBSCAN: Sensitive to
Parameters
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
72
OPTICS: A Cluster-Ordering Method
(1999)
OPTICS: Ordering Points To Identify the Clustering
St t
Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
Can be represented g graphicallyy or using
g visualization
techniques
73
OPTICS: Some Extension from
DBSCAN
Index-based: k = # of dimensions, N: # of points
Complexity: O(N*logN)
Core Distance of an object p: the smallest value ε such that
the εε-neighborhood
neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p, ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) <
MinPts
MinPts-distance(p), otherwise
Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distanceε, MinPts(p, q) =
Undefined
U d fi d if q iis nott a core object
bj t
max(core-distance(q), distance (q, p)), otherwise
74
Core Distance & Reachability
Distance
75
Reachability-
distance
undefined
ε
ε
ε
‘
Cluster-order
Cluster order of the
objects
76
Density-Based Clustering: OPTICS &
Applications:
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo
77
Grid-Based Clustering Method
79
The STING Clustering Method
Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type off distribution—normal,
di ib i l uniform,
if etc.
Use a top-down approach to answer spatial data queries
Start from a p pre-selected layer—typically
y yp y with a small number
of cells
For each cell in the current level compute the confidence
interval
80
STING Algorithm and Its
Analysis
Remove the irrelevant cells from further consideration
When
Wh fifinish
i h examining
i i the
h current llayer, proceed
d to the
h
next lower level
p
Repeat this p
process until the bottom layer
y is reached
Advantages:
Query-independent, easy to parallelize, incremental
update
d t
O(K), where K is the number of grid cells at the lowest
level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical and no diagonal boundary is detected
vertical,
81
Assessing Clustering Tendency
Assess if non-random structure exists in the data by
measuring the probability that the data is generated by a
uniform
if data
d t distribution
di t ib ti
Test spatial randomness by statistic test: Hopkins Static
Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
Sample n points, p1, …, pn, uniformly from D. For each pi, find
its nearest neighbor in D: xi = min{dist (pi, v)} where v in D
Sample
S l n points,
i t q1,
1 …, qn, uniformly
if l from
f D.
D For
F each h qi,i find
fi d
its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D
and v ≠ qi
Calculate the Hopkins Statistic:
82
Determine the Number of Clusters
Empirical method
# of clusters ≈√n/2 for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts
Use m – 1 p parts to obtain a clustering
g model
Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid,
and use the sum of squared distance between all points in
th ttestt sett and
the d the
th closest
l t centroids
t id tot measure how h wellll
the model fits the test set
For any k > 0, repeat it m times, compare the overall quality
measure w w.r.t.
r t different kk’ss, and find # of clusters that fits the data
the best
83
Measuring Clustering Quality
84
Measuring Clustering Quality:
Extrinsic Methods
Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
Q is good if it satisfies the following 4 essential criteria
Cluster homogeneity: the purerpurer, the better
Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
Small cluster preservation: splitting a small category
into p
pieces is more harmful than splitting
p g a large
g
category into pieces
85
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods,, density-based
y methods,, grid-based
g methods,,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering
g algorithms
g
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
g
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based
grid based methods
methods, where CLIQUE is
also a subspace clustering algorithm
Quality of clustering results can be evaluated in various ways 86
Summary
Clustering General Characteristics
Method
Partitioning oFind mutually exclusive clusters of spherical shape
Methods oDistance-based
oMay use mean or medoid (etc.) to represent cluster center
oEffective for small to medium size data sets
Hierarchical oClustering is a hierarchical decomposition (i.e., multiple levels)
Methods oCannot correct erroneous merges or splits
oMay incorporate other techniques like microclustering or consider
object
bj t “linkages”
“li k ”
Density-based oCan find arbitrarily shaped clusters
Methods oClusters are dense regions of objects in space that are separated
by low-density
low density regions
oCluster density: Each point must have a minimum number of
points within its “neighborhood”
oMay filter out outliers
Grid-based oUse a multiresolution grid data structure
Methods oFast processing time (typically independent of the number of data
87
objects, yet dependent on grid size)
Reference
1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2 Ian H.
2. H Witten
Witten, Frank Eibe
Eibe, Mark AA. Hall,
Hall Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications
Applications, CRC Press
Taylor & Francis Group, 2014
4. Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6. Florin Gorunescu, Data Mining: Concepts, Models and
Techniques Springer,
Techniques, Springer 2011
7. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Ad
Advances iin D
Data
t Mi
Mining
i off E
Enterprise
t i D Data:
t Algorithms
Al ith andd
Applications, World Scientific, 2007
88