Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

) (
jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir
Data Mining
Clustering
Data Mining
What is Cluster Analysis?

Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
Clustering: Applications
Pattern Recognition
Image Processing
Economic Science (especially market research)
WWW
Document classification Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location
Quality: What Is Good Clustering?

A good clustering method will produce high quality clusters with
high intra-class similarity low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Measure the Quality of Clustering

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, typically metric: d(i, j)

There is a separate quality function that measures the goodness of a cluster. The definitions of distance functions are usually very different for any kinds of variables.
It is hard to define similar enough or good enough

the answer is typically highly subjective. 7
Requirements of Clustering in Machine Learning

Scalability
Ability to deal with different types of attributes

Ability to handle dynamic data Minimal requirements for domain knowledge to determine
input parameters
Able to deal with noise and outliers High dimensionality Interpretability and usability
Similarity and Dissimilarity Between Objects

There are some methods to measure similarity (Distance) between objects (can refer to KNN slides )
Dissimilarity between Binary Variables

A contingency table for binary data
Object j
1
Object i
0 b d
sum a b cd p
1 0
a c
sum a c b d
Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables:
d (i, j)
bc a bc d
d (i, j)
bc a bc
10
Dissimilarity between Binary Variables

Example
Name Jack Mary Jim
Gender M F M
Fever Y Y Y
Cough N N P
Test-1 P P N
Test-2 N N N
Test-3 N P N
Test-4 N N N
gender is a symmetric attribute the remaining attributes are asymmetric binary 01 d ( jack , m ary) 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , m ary) 0.75 11 2
11
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching
m: # of matches, p: total # of variables
d (i, j) p m p
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
12
Major Clustering Approaches (I)

Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: Graph Partitioning, k-means, k-medoids
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK
Density-based approach:
Based on connectivity and density functions Typical methods: DBSACN, OPTICS
13
Major Clustering Approaches (II)

Grid-based approach:
based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns Typical methods: pCluster
14
An Example: Graph Partitioning(1)

Sessions Co-Occurrence Matrix
S1: a, c, b, d, c
S2: b, d, e, a S3:e,c
[...]
Sn: b, e, d
Create Undirected Graph based on Matrix M
15
Co-occurrence Matrix
Undirected Graph
Clusters
C1: a, b, e C2: c, f, g, i
C3: d
Small Clusters are filtered out (MinClusterSize)
16
17
17
Distance (Similarity) Matrix for Graph-based Clustering

Similarity (Distance) Matrix
based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) (i, j) entry in the matrix is the distance (similarity) between items i and j
Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1s (similarity) or all 0s (distance)
dij similarity (or distance) of Di to D j

18
Example: Term Similarities in Documents

Suppose we want to cluster terms that appear in a collection of documents with different frequencies
Each term can be viewed as a vector of term frequencies (weights)
Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2
We need to compute a term-term similarity matrix

For simplicity we use the dot product as similarity measure (note that this is the non-normalized version of cosine similarity)
sim(Ti ,T j ) ( wik w jk )
k 1
N = total number of dimensions (in this case documents) wik = weight of term i in document k.
Example:
Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2> 0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7
19
Example: Term Similarities in Documents

Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3
N
T5 0 1 3 0 1
T6 2 2 0 0 4
T7 1 0 3 2 0
T8 3 1 0 0 2
sim(Ti , Tj ) ( wik w jk )
k 1
Term-Term Similarity Matrix
T2 T3 T4 T5 T6 T7 T8
T1 7 16 15 14 14 9 7
T2 8 12 3 18 6 17
T3
T4
T5
T6
T7
18 6 16 0 8
6 18 6 9
6 9 3
2 16
20
Similarity (Distance) Thresholds

A similarity (distance) threshold may be used to mark pairs that are sufficiently similar
T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 T2 8 12 3 18 6 17 T3 T4 T5 T6 T7
18 6 16 0 8
6 18 6 9
6 9 3
2 16
T2 T3 T4 T5 T6 T7 T8
T1 0 1 1 1 1 0 0
T2 0 1 0 1 0 1
T3
T4
T5
T6
T7
Using a threshold value of 10 in the previous example
1 0 1 0 0
0 1 0 0
0 0 0
0 1
21
Graph Representation
The similarity matrix can be visualized as an undirected graph
each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)
T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0
T1
T3
T5
T4
T2
If no threshold is used, then matrix can be represented as a weighted graph
T7 T6 T8
22
Simple Clustering Algorithms

If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering Clique Method (complete link)
all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters
Single Link Method

any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters
23

Clique Method
a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster
T1 T3
Maximal cliques (and therefore the clusters) in the previous example are:
T5
T4
T2
{T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.
T7 T6 T8
24

Single Link Method
selected an item not in a cluster and place it in a new cluster place all other similar item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered
T1 T3
In this case the single link method produces only two clusters:
T5
{T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.
T4
T2
T7 T6 T8
25
K-Means (A partitioning clustering Method)

In machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Centroid: the middle of a cluster
Cm
sum of squared distance
iN 1(t
N
ip
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min
k 1tmiKm (Cm tmi )2 m

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
Heuristic methods: k-means and k-medoids algorithms

k-means : Each cluster is represented by the center of the cluster k-medoids: Each cluster is represented by one of the objects in the cluster
26
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:

Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment
27
The K-Means Clustering Method
10
10 9 8 7 6 5
10
9
9
8
8
7
7
6
6
5
5
4
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Assign each objects to most similar center
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8
10 9 8 7 6
reassign
K=2 Arbitrarily choose K object as initial cluster center
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
28
The K-Means Clustering Method: An Example
Document Clustering
Initial Set with k=3 C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}
Cluster Means
D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3
T1 0 4 0 0 0 2 1 3 4/2 0/2 2/2
T2 3 1 4 3 1 2 0 1 4/2 7/2 3/2
T3 3 0 0 0 3 0 3 0 3/2 0/2 3/2
T4 0 1 0 3 0 0 2 0 1/2 3/2 0/2
T5 2 2 2 3 1 4 0 2 4/2 5/2 5/2
29
Example: K-Means
Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).
C1 C2 C3
D1 29/2 31/2 28/2
D2 29/2 20/2 21/2
D3 24/2 38/2 22/2
D4 27/2 45/2 24/2
D5 17/2 12/2 17/2
D6 32/2 34/2 30/2
D7 15/2 6/2 11/2
D8 24/2 17/2 19/2
For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation 30
Example: K-Means
Now compute new cluster centroids using the original document-term matrix
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3
T1 0 4 0 0 0 2 1 3 8/3 2/4 0/1
T2 3 1 4 3 1 2 0 1 2/3 12/4 1/1
T3 3 0 0 0 3 0 3 0 3/3 3/4 3/1
T4 0 1 0 3 0 0 2 0 3/3 3/4 0/1
T5 2 2 2 3 1 4 0 2 4/3 11/4 1/1
31
Comments on the K-Means Method

Weakness
Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers
32
Variations of the K-Means Method

A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means
33
What Is the Problem of the K-Means Method?

The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data.
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
34
The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids) Algorithm :
1. Initialize: randomly select k of the n data points as the medoids 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each medoid m
1. For each non-medoid data point o 2. Swap m and o and compute the total cost of the configuration
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 5 until there is no change in the medoid.
35
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 1 Step 2 Step 3 Step 4
Step 0
agglomerative (AGNES-Agglomerative Nesting)
ab abcde
b
c d e
Step 4 Step 3
cde
de
Step 2 Step 1 Step 0
divisive (DIANA-Divisive analysis)
36
Agglomerative hierarchical clustering

Group data objects in a bottom-up fashion.
Initially each data object is in its own cluster.

Then we merge these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. A user can specify the desired number of clusters as a termination condition.
37
Divisive hierarchical clustering
Groups data objects in a top-down fashion. Initially all data objects are in one cluster. We then subdivide the cluster into smaller and smaller clusters, until each
object forms cluster on its own or satisfies certain termination conditions,

such as a desired number of clusters is obtained.
38
AGNES (Agglomerative Nesting)
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
39
Dendrogram: Shows How the Clusters are Merged
40
Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as densityconnected points Major features:
Discover clusters of arbitrary shape
Handle noise
One scan Need density parameters as termination condition
41
Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)& OPTIC (Ordering Points To Identify the Clustering Structure)
42
GRID-BASED CLUSTERING METHODS
This is the approach in which we quantize space into a finite number of
cells that form a grid structure on which all of the operations for
clustering is performed.
Assume that we have a set of records and we want to cluster with respect
to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.
43
Salary (10,000)
8 7 6 5 4 3 2 1 0 20 30 40 50 60
Our space is this plane
Age
44
STING: A Statistical Information Grid Approach

The spatial area is divided into rectangular cells There are several levels of cells corresponding to different levels of resolution
45
Different Grid Levels during Query Processing.
46
The STING Clustering Method

Each cell at a high level is partitioned into a number of smaller cells in the next lower
level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max type of distributionnormal, uniform, etc.
Use a top-down approach to answer spatial data queries

Start from a pre-selected layertypically with a small number of cells For each cell in the current level compute the confidence interval 47
Comments on STING
Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages:
Query-independent, easy to parallelize, incremental update
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
48
Model-Based Clustering
What is model-based clustering?
Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution
Typical methods
Statistical approach
EM (Expectation maximization), AutoClass

Machine learning approach
COBWEB, CLASSIT
Neural network approach
SOM (Self-Organizing Feature Map)
49
Conceptual Clustering
Conceptual clustering
A form of clustering in machine learning Produces a classification scheme for a set of unlabeled objects Finds characteristic description for each concept (class)
COBWEB
A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that
concept
50
COBWEB Clustering Method

A classification tree
51
What Is Outlier Discovery?

What are outliers?
The set of objects are considerably dissimilar from the remainder of the data
Problem: Define and find outliers in large data sets Applications:

Credit card fraud detection
Customer segmentation
Medical analysis
52

1. UCI Weka :
KMeans DBScan OPTIC EM 1. 2. 3. 4.
35

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

Uploaded by

Copyright:

Available Formats

) (

What is Cluster Analysis?

Unsupervised learning: no predefined classes

Examples of Clustering Applications

then use this knowledge to develop targeted marketing programs

Quality: What Is Good Clustering?

Measure the Quality of Clustering

function, typically metric: d(i, j)

It is hard to define similar enough or good enough

Requirements of Clustering in Machine Learning

Ability to deal with different types of attributes

Similarity and Dissimilarity Between Objects

Dissimilarity between Binary Variables

Dissimilarity between Binary Variables

Name Jack Mary Jim

Major Clustering Approaches (I)

Major Clustering Approaches (II)

An Example: Graph Partitioning(1)

An Example: Graph Partitioning(2)

An Example: Graph Partitioning(3)

Distance (Similarity) Matrix for Graph-based Clustering

dij similarity (or distance) of Di to D j

Example: Term Similarities in Documents

We need to compute a term-term similarity matrix

Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2> 0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7

Example: Term Similarities in Documents

Term-Term Similarity Matrix

Similarity (Distance) Thresholds

Using a threshold value of 10 in the previous example

If no threshold is used, then matrix can be represented as a weighted graph

Simple Clustering Algorithms

Single Link Method

Simple Clustering Algorithms

Simple Clustering Algorithms

K-Means (A partitioning clustering Method)

k 1tmiKm (Cm tmi )2 m

Heuristic methods: k-means and k-medoids algorithms

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four steps:

The K-Means Clustering Method

Assign each objects to most similar center

Update the cluster means

K=2 Arbitrarily choose K object as initial cluster center

Update the cluster means

The K-Means Clustering Method: An Example

Initial Set with k=3 C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}

T1 0 4 0 0 0 2 1 3 4/2 0/2 2/2

T2 3 1 4 3 1 2 0 1 4/2 7/2 3/2

T3 3 0 0 0 3 0 3 0 3/2 0/2 3/2

T4 0 1 0 3 0 0 2 0 1/2 3/2 0/2

T5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

D1 29/2 31/2 28/2

D2 29/2 20/2 21/2

D3 24/2 38/2 22/2

D4 27/2 45/2 24/2

D5 17/2 12/2 17/2

D6 32/2 34/2 30/2

D7 15/2 6/2 11/2

D8 24/2 17/2 19/2

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

T1 0 4 0 0 0 2 1 3 8/3 2/4 0/1