Professional Documents
Culture Documents
jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir
Data Mining
Clustering
Data Mining
Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Clustering: Applications
Pattern Recognition
Image Processing
Economic Science (especially market research)
WWW
Document classification Cluster Weblog data to discover groups of similar access patterns
The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
input parameters
Able to deal with noise and outliers High dimensionality Interpretability and usability
Object j
1
Object i
0 b d
sum a b cd p
1 0
a c
sum a c b d
Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables:
d (i, j)
bc a bc d
d (i, j)
bc a bc
10
Gender M F M
Fever Y Y Y
Cough N N P
Test-1 P P N
Test-2 N N N
Test-3 N P N
Test-4 N N N
gender is a symmetric attribute the remaining attributes are asymmetric binary 01 d ( jack , m ary) 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , m ary) 0.75 11 2
11
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching
m: # of matches, p: total # of variables
d (i, j) p m p
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
12
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK
Density-based approach:
Based on connectivity and density functions Typical methods: DBSACN, OPTICS
13
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns Typical methods: pCluster
14
S1: a, c, b, d, c
S2: b, d, e, a S3:e,c
[...]
Sn: b, e, d
Create Undirected Graph based on Matrix M
15
Co-occurrence Matrix
Undirected Graph
Clusters
C1: a, b, e C2: c, f, g, i
C3: d
Small Clusters are filtered out (MinClusterSize)
16
17
17
sim(Ti ,T j ) ( wik w jk )
k 1
N = total number of dimensions (in this case documents) wik = weight of term i in document k.
Example:
19
T5 0 1 3 0 1
T6 2 2 0 0 4
T7 1 0 3 2 0
T8 3 1 0 0 2
sim(Ti , Tj ) ( wik w jk )
k 1
T2 T3 T4 T5 T6 T7 T8
T1 7 16 15 14 14 9 7
T2 8 12 3 18 6 17
T3
T4
T5
T6
T7
18 6 16 0 8
6 18 6 9
6 9 3
2 16
20
18 6 16 0 8
6 18 6 9
6 9 3
2 16
T2 T3 T4 T5 T6 T7 T8
T1 0 1 1 1 1 0 0
T2 0 1 0 1 0 1
T3
T4
T5
T6
T7
1 0 1 0 0
0 1 0 0
0 0 0
0 1
21
Graph Representation
The similarity matrix can be visualized as an undirected graph
each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)
T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0
T1
T3
T5
T4
T2
T7 T6 T8
22
23
Maximal cliques (and therefore the clusters) in the previous example are:
T5
T4
T2
{T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.
T7 T6 T8
24
In this case the single link method produces only two clusters:
T5
{T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.
T4
T2
T7 T6 T8
25
Cm
sum of squared distance
iN 1(t
N
ip
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min
26
27
10
10 9 8 7 6 5
10
9
9
8
8
7
7
6
6
5
5
4
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8
10 9 8 7 6
reassign
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
28
Document Clustering
Cluster Means
D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3
29
Example: K-Means
Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).
C1 C2 C3
For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.
This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation 30
Example: K-Means
Now compute new cluster centroids using the original document-term matrix
D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3
31
32
33
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
34
1. For each non-medoid data point o 2. Swap m and o and compute the total cost of the configuration
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 5 until there is no change in the medoid.
35
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 1 Step 2 Step 3 Step 4
Step 0
ab abcde
b
c d e
Step 4 Step 3
cde
de
Step 2 Step 1 Step 0
36
37
Groups data objects in a top-down fashion. Initially all data objects are in one cluster. We then subdivide the cluster into smaller and smaller clusters, until each
38
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
39
40
Handle noise
One scan Need density parameters as termination condition
41
Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)& OPTIC (Ordering Points To Identify the Clustering Structure)
42
cells that form a grid structure on which all of the operations for
clustering is performed.
Assume that we have a set of records and we want to cluster with respect
to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.
43
Salary (10,000)
8 7 6 5 4 3 2 1 0 20 30 40 50 60
Age
44
45
46
level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max type of distributionnormal, uniform, etc.
Comments on STING
Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages:
Query-independent, easy to parallelize, incremental update
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
48
Model-Based Clustering
What is model-based clustering?
Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution
Typical methods
Statistical approach
COBWEB, CLASSIT
Neural network approach
49
Conceptual Clustering
Conceptual clustering
A form of clustering in machine learning Produces a classification scheme for a set of unlabeled objects Finds characteristic description for each concept (class)
COBWEB
A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that
concept
50
51
Customer segmentation
Medical analysis
52
1. UCI Weka :
KMeans DBScan OPTIC EM 1. 2. 3. 4.
35