You are on page 1of 53

) (

jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir

Data Mining

Clustering

Data Mining

What is Cluster Analysis?


Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters

Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes

Clustering: Applications
Pattern Recognition

Image Processing
Economic Science (especially market research)

WWW
Document classification Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications


Marketing: Help marketers discover distinct groups in their customer bases, and

then use this knowledge to develop targeted marketing programs


Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location

Quality: What Is Good Clustering?


A good clustering method will produce high quality clusters with
high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns

Measure the Quality of Clustering


Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance

function, typically metric: d(i, j)


There is a separate quality function that measures the goodness of a cluster. The definitions of distance functions are usually very different for any kinds of variables.

It is hard to define similar enough or good enough


the answer is typically highly subjective. 7

Requirements of Clustering in Machine Learning


Scalability

Ability to deal with different types of attributes


Ability to handle dynamic data Minimal requirements for domain knowledge to determine

input parameters
Able to deal with noise and outliers High dimensionality Interpretability and usability

Similarity and Dissimilarity Between Objects


There are some methods to measure similarity (Distance) between objects (can refer to KNN slides )

Dissimilarity between Binary Variables


A contingency table for binary data

Object j

1
Object i

0 b d

sum a b cd p

1 0

a c

sum a c b d

Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables:

d (i, j)

bc a bc d

d (i, j)

bc a bc
10

Dissimilarity between Binary Variables


Example

Name Jack Mary Jim

Gender M F M

Fever Y Y Y

Cough N N P

Test-1 P P N

Test-2 N N N

Test-3 N P N

Test-4 N N N

gender is a symmetric attribute the remaining attributes are asymmetric binary 01 d ( jack , m ary) 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , m ary) 0.75 11 2

11

Nominal Variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching
m: # of matches, p: total # of variables

d (i, j) p m p
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states

12

Major Clustering Approaches (I)


Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: Graph Partitioning, k-means, k-medoids

Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK

Density-based approach:
Based on connectivity and density functions Typical methods: DBSACN, OPTICS

13

Major Clustering Approaches (II)


Grid-based approach:
based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB

Frequent pattern-based:
Based on the analysis of frequent patterns Typical methods: pCluster

14

An Example: Graph Partitioning(1)


Sessions Co-Occurrence Matrix

S1: a, c, b, d, c
S2: b, d, e, a S3:e,c

[...]
Sn: b, e, d
Create Undirected Graph based on Matrix M

15

An Example: Graph Partitioning(2)

Co-occurrence Matrix

Undirected Graph

Clusters

C1: a, b, e C2: c, f, g, i

C3: d
Small Clusters are filtered out (MinClusterSize)

16

An Example: Graph Partitioning(3)

17
17

Distance (Similarity) Matrix for Graph-based Clustering


Similarity (Distance) Matrix
based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) (i, j) entry in the matrix is the distance (similarity) between items i and j
Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1s (similarity) or all 0s (distance)

dij similarity (or distance) of Di to D j


18

Example: Term Similarities in Documents


Suppose we want to cluster terms that appear in a collection of documents with different frequencies
Each term can be viewed as a vector of term frequencies (weights)
Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2

We need to compute a term-term similarity matrix


For simplicity we use the dot product as similarity measure (note that this is the non-normalized version of cosine similarity)

sim(Ti ,T j ) ( wik w jk )
k 1

N = total number of dimensions (in this case documents) wik = weight of term i in document k.

Example:

Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2> 0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7

19

Example: Term Similarities in Documents


Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3
N

T5 0 1 3 0 1

T6 2 2 0 0 4

T7 1 0 3 2 0

T8 3 1 0 0 2

sim(Ti , Tj ) ( wik w jk )
k 1

Term-Term Similarity Matrix

T2 T3 T4 T5 T6 T7 T8

T1 7 16 15 14 14 9 7

T2 8 12 3 18 6 17

T3

T4

T5

T6

T7

18 6 16 0 8

6 18 6 9

6 9 3

2 16

20

Similarity (Distance) Thresholds


A similarity (distance) threshold may be used to mark pairs that are sufficiently similar
T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 T2 8 12 3 18 6 17 T3 T4 T5 T6 T7

18 6 16 0 8

6 18 6 9

6 9 3

2 16

T2 T3 T4 T5 T6 T7 T8

T1 0 1 1 1 1 0 0

T2 0 1 0 1 0 1

T3

T4

T5

T6

T7

Using a threshold value of 10 in the previous example

1 0 1 0 0

0 1 0 0

0 0 0

0 1

21

Graph Representation
The similarity matrix can be visualized as an undirected graph
each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix)
T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0
T1

T3

T5

T4

T2

If no threshold is used, then matrix can be represented as a weighted graph

T7 T6 T8

22

Simple Clustering Algorithms


If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering Clique Method (complete link)
all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters

Single Link Method


any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters

23

Simple Clustering Algorithms


Clique Method
a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster
T1 T3

Maximal cliques (and therefore the clusters) in the previous example are:
T5

T4

T2

{T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

T7 T6 T8

24

Simple Clustering Algorithms


Single Link Method
selected an item not in a cluster and place it in a new cluster place all other similar item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered
T1 T3

In this case the single link method produces only two clusters:
T5

{T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

T4

T2

T7 T6 T8

25

K-Means (A partitioning clustering Method)


In machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Centroid: the middle of a cluster

Cm
sum of squared distance

iN 1(t
N

ip

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min

k 1tmiKm (Cm tmi )2 m


Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Heuristic methods: k-means and k-medoids algorithms


k-means : Each cluster is represented by the center of the cluster k-medoids: Each cluster is represented by one of the objects in the cluster

26

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in four steps:


Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment

27

The K-Means Clustering Method

10

10 9 8 7 6 5

10
9

9
8

8
7

7
6

6
5

5
4

4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Assign each objects to most similar center

3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

reassign
10 9 8
10 9 8 7 6

reassign

K=2 Arbitrarily choose K object as initial cluster center

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

28

The K-Means Clustering Method: An Example

Document Clustering

Initial Set with k=3 C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6}

Cluster Means

D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3

T1 0 4 0 0 0 2 1 3 4/2 0/2 2/2

T2 3 1 4 3 1 2 0 1 4/2 7/2 3/2

T3 3 0 0 0 3 0 3 0 3/2 0/2 3/2

T4 0 1 0 3 0 0 2 0 1/2 3/2 0/2

T5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

29

Example: K-Means
Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure).

C1 C2 C3

D1 29/2 31/2 28/2

D2 29/2 20/2 21/2

D3 24/2 38/2 22/2

D4 27/2 45/2 24/2

D5 17/2 12/2 17/2

D6 32/2 34/2 30/2

D7 15/2 6/2 11/2

D8 24/2 17/2 19/2

For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation 30

Example: K-Means
Now compute new cluster centroids using the original document-term matrix

C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}

D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3

T1 0 4 0 0 0 2 1 3 8/3 2/4 0/1

T2 3 1 4 3 1 2 0 1 2/3 12/4 1/1

T3 3 0 0 0 3 0 3 0 3/3 3/4 3/1

T4 0 1 0 3 0 0 2 0 3/3 3/4 0/1

T5 2 2 2 3 1 4 0 2 4/3 11/4 1/1

31

Comments on the K-Means Method


Weakness
Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers

32

Variations of the K-Means Method


A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means

33

What Is the Problem of the K-Means Method?


The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

34

The K-Medoids Clustering Method


Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids) Algorithm :
1. Initialize: randomly select k of the n data points as the medoids 2. Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) 3. For each medoid m

1. For each non-medoid data point o 2. Swap m and o and compute the total cost of the configuration
4. Select the configuration with the lowest cost.
5. repeat steps 2 to 5 until there is no change in the medoid.

35

Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 1 Step 2 Step 3 Step 4

Step 0

agglomerative (AGNES-Agglomerative Nesting)

ab abcde

b
c d e
Step 4 Step 3

cde
de
Step 2 Step 1 Step 0

divisive (DIANA-Divisive analysis)

36

Agglomerative hierarchical clustering


Group data objects in a bottom-up fashion.

Initially each data object is in its own cluster.


Then we merge these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. A user can specify the desired number of clusters as a termination condition.

37

Divisive hierarchical clustering

Groups data objects in a top-down fashion. Initially all data objects are in one cluster. We then subdivide the cluster into smaller and smaller clusters, until each

object forms cluster on its own or satisfies certain termination conditions,


such as a desired number of clusters is obtained.

38

AGNES (Agglomerative Nesting)

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

39

Dendrogram: Shows How the Clusters are Merged

40

Density-Based Clustering Methods


Clustering based on density (local cluster criterion), such as densityconnected points Major features:
Discover clusters of arbitrary shape

Handle noise
One scan Need density parameters as termination condition

41

Density-Based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)& OPTIC (Ordering Points To Identify the Clustering Structure)

42

GRID-BASED CLUSTERING METHODS

This is the approach in which we quantize space into a finite number of

cells that form a grid structure on which all of the operations for
clustering is performed.

Assume that we have a set of records and we want to cluster with respect
to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters.

43

Salary (10,000)

8 7 6 5 4 3 2 1 0 20 30 40 50 60

Our space is this plane

Age

44

STING: A Statistical Information Grid Approach


The spatial area is divided into rectangular cells There are several levels of cells corresponding to different levels of resolution

45

Different Grid Levels during Query Processing.

46

The STING Clustering Method


Each cell at a high level is partitioned into a number of smaller cells in the next lower

level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max type of distributionnormal, uniform, etc.

Use a top-down approach to answer spatial data queries


Start from a pre-selected layertypically with a small number of cells For each cell in the current level compute the confidence interval 47

Comments on STING
Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages:
Query-independent, easy to parallelize, incremental update

Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

48

Model-Based Clustering
What is model-based clustering?
Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution

Typical methods
Statistical approach

EM (Expectation maximization), AutoClass


Machine learning approach

COBWEB, CLASSIT
Neural network approach

SOM (Self-Organizing Feature Map)

49

Conceptual Clustering
Conceptual clustering
A form of clustering in machine learning Produces a classification scheme for a set of unlabeled objects Finds characteristic description for each concept (class)

COBWEB
A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that

concept

50

COBWEB Clustering Method


A classification tree

51

What Is Outlier Discovery?


What are outliers?
The set of objects are considerably dissimilar from the remainder of the data

Problem: Define and find outliers in large data sets Applications:


Credit card fraud detection

Customer segmentation
Medical analysis

52


1. UCI Weka :
KMeans DBScan OPTIC EM 1. 2. 3. 4.

35

You might also like