Clustering Techniques: Welcome To

IBM ICE (Innovation Centre for Education)
Welcome to:
Clustering Techniques
9.1
Unit objectives IBM ICE (Innovation Centre for Education)
IBM Power Systems
After completing this unit you should be able to:
• Understand the concept of clustering and clustering algorithm
• Gain an insight into clustering application
• Gain knowledge on types of data in clustering analysis
• Understand the concept of measures of cluster validity

Clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• It is the process of grouping a set of objects into classes of similar objects.
• A form of unsupervised learning- learning from raw data, as opposed to supervised data
where a classification of examples is given.
• A common and important task that finds many applications in IR and other places.
– Documents within a cluster should be similar.
– Documents from different clusters should be dissimilar.
• So, it’s a method of data exploration - a way of looking for patterns or structure in the data
that are of interest.
Clustering algorithms IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Clustering approaches group the data based on certain similarity of features.
• Clustering methods generally use centroid-based and hierarchal kind of modelling

approaches.
• Most of the methods tend to use the inherent structures in the data to best organize the data
into groups, considering the maximum similarity in the data.
• Well known clustering algorithms are:

– k-Means.
– k-Medians.
– Expectation Maximization (EM).
– Hierarchical clustering.
More common clustering situation IBM ICE (Innovation Centre for Education)
IBM Power Systems
Variable 1
X
Variable 2
Statistics associated with
cluster analysis IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Agglomeration schedule: Gives information on the objects or cases being combined at

each stage of a hierarchical clustering process.
• Cluster centroid: Mean values of the variables for all the cases in a particular cluster.
• Cluster centers: Initial starting points in nonhierarchical clustering. Clusters are built around
these centers or seeds.
• Cluster membership: Indicates the cluster to which each object or case belongs.
• Dendrogram (A tree graph): A graphical device for displaying clustering results.

– Vertical lines represent clusters that are joined together.
– The position of the line on the scale indicates distances at which clusters were joined.
• Distances between cluster centers: These distances indicate how separated the individual
pairs of clusters. Clusters that are widely separated are distinct and therefore desirable.
• Icicle diagram: Another type of graphical display of clustering results.

General applications of clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Pattern recognition.
• Spatial data analysis.

– Create thematic maps in GIS by clustering feature spaces.
– Detect spatial clusters and explain them in spatial data mining.
• Image processing.
• Economic science (especially market research).
• www:
– Document classification.
– Cluster Weblog data to discover groups of similar access patterns.
Clustering as a preprocessing tool IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Summarization: Preprocessing for regression, PCA, classification, and association analysis.
• Compression
– Image processing: Vector quantization
• Finding K-nearest neighbors

– Localizing search to one or a small number of clusters.
• Outlier detection
– Outliers are often viewed as those “far away” from any cluster.
Hard vs. soft clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Hard clustering: Each object belongs to exactly one cluster.

– More common and easier to do.
• Soft clustering: An object can belong to more than one cluster.

– Makes more sense for applications like creating browsable hierarchies.
– You may want to put a pair of sneakers in two clusters:
• Sports apparel.
• Shoes.
– You can only do that with a soft clustering approach.
Similarity and dissimilarity
between objects IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Distances are normally used to measure the similarity or dissimilarity between two data
objects.
• Some popular ones include Minkowski distance:
q q q
d (i, j)  (| x  x |  | x  x | ... | x  x | )
q
i1 j1 i2 j2 ip jp
– where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a
positive integer
d(i, j) | x  x | | x  x | ...| x  x |
i1 j1 i2 j2 ip jp
• If q = 1, d is Manhattan distance
Type of data in clustering analysis IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Different types of data in clustering analysis are:

– Interval-scaled variables.
– Binary variables.
– Nominal, ordinal and ratio variables.
– Variables of mixed types.
Binary variables IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A contingency table for binary data
Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a  c b  d p
• Simple matching coefficient (invariant, if the binary variable is symmetric).
• Jaccard coefficient (non-invariant if the binary variable is asymmetric):
d (i, j)  bc d (i , j )  bc

a bc  d abc
Nominal variables IBM ICE (Innovation Centre for Education)
IBM Power Systems
• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green.
• Method 1: Simple matching.

– m: # of matches, p: total # of variables.
• Method 2: use a large number of binary variables.

– Creating a new binary variable for each of the M nominal states.
d ( i , j )  p p m
Ordinal variables IBM ICE (Innovation Centre for Education)
IBM Power Systems
• An ordinal variable can be discrete or continuous.
• Order is important, e.g., rank.
• Can be treated like interval-scaled:

– Replace xif by their rank.
– Map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
rif  {1,..., M f }
– Compute the dissimilarity using methods for interval-scaled variables:
r if  1
z 
if M f  1
Major clustering approaches IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Some of the clustering approaches are

– Partitioning approach.
– Hierarchical approach.
– Density-based approach.
– Grid-based approach.
– Model-based.
– Frequent pattern-based.
– User-guided or constraint-based.
– Link-based clustering.
Types of clusters IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Different types of clusters are:

– Well-separated clusters.
– Center-based clusters.
– Contiguous clusters.
– Density-based clusters.
– Property or Conceptual based clusters.
– Clusters described by an objective function.
Cluster centroid and distances IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Cluster centroid: The centroid of a cluster is a point whose parameter values are the mean
of the parameter values of all the points in the clusters.
• Distance: Generally, the distance between two points is taken as a common metric to
assess the similarity among the components of a population. The commonly used distance
measure is the Euclidean metric which defines the distance between two points p= ( p1, p2,
....) and q = ( q1, q2, ....) is given by :
Hierarchical clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Produces a set of nested clusters organized as a hierarchical tree.
• Can be visualized as a dendrogram:

– A tree-like diagram that records the sequences of merges or splits.
• Does not require the number of clusters k in advance.
• Needs a termination/readout condition.
6 5
0.2
4
0.15 3 4
2
5
0.1 2
0.05 1
3 1
0
1 3 2 5 4 6
Hierarchical Agglomerative
Clustering (HAC) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Starts with each doc in a separate cluster

– Repeatedly joins the closest pair of clusters, until there is only one cluster.
• The history of merging forms a binary tree or hierarchy.
• How to measure distance of clusters??
• Assumes a similarity function for determining the similarity of two instances.
• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or hierarchy.

Hierarchical Agglomerative
Clustering: Linkage method IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The single linkage method is based on minimum distance, or the nearest neighbor rule.
• The complete linkage method is based on the maximum distance or the furthest neighbor
approach.
• The average linkage method the distance between two clusters is defined as the average of
the distances between all pairs of objects.
Hierarchical Agglomerative Clustering:
Variance and centroid method IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Variance methods generate clusters to minimize the within-cluster variance.
• Ward's procedure is commonly used. For each cluster, the sum of squares is calculated. The
two clusters with the smallest increase in the overall sum of squares within cluster distances
are combined.
• In the centroid methods, the distance between two clusters is the distance between their
centroids (means for all the variables).
• Of the hierarchical methods, average linkage and Ward's methods have been shown to
perform better than the other procedures.
Cluster distance measures IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Single link: Smallest distance between an element in one cluster and an element in the
other, i.e., d(Ci, Cj) = min{d(xip, xjq)}.
• Complete link: Largest distance between an element in one cluster and an element in the
other, i.e., d(Ci, Cj) = max{d(xip, xjq)}.
• Average: Average distance between elements in one cluster and elements in the other, i.e.,
d(Ci, Cj) = avg{d(xip, xjq)}
single link
(min)
complete link
(max)
average
Single link agglomerative clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• The minimum of all pairwise distances between points in the two clusters.
• Tends to produce long, “loose” clusters.

Complete-link clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Complete-link distance between clusters Ci and Cj is the maximum distance between any
object in Ci and any object in Cj.
• The distance is defined by the two most dissimilar objects

Dcl Ci , C j   max x , y d ( x, y) x  Ci , y  C j 
• Example: Distance between clusters is determined by the two most distant points in the
different clusters.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Average-link clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Group average distance between clusters Ci and Cj is the average distance between any
object in Ci and any object in Cj.
Davg Ci , C j  
1
Ci  C j
d (x, y)
xCi , yC j
• Example: Proximity of two clusters is the average of pairwise proximity between points in the
two clusters.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00
Other agglomerative clustering methods IBM ICE (Innovation Centre for Education)
IBM Power Systems
Ward’s Procedure
Centroid Method
Distance between two clusters IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Centroid distance between clusters Ci and Cj is the distance between the centroid ri of Ci
and the centroid rj of Cj.
Dcentroids Ci , C j   d (ri , rj )
• Ward’s distance between clusters Ci and Cj is the difference between the total within cluster
sum of squares for the two clusters separately, and the within cluster sum of squares
resulting from merging the two clusters in cluster Cij
– ri: centroid of Ci.
– rj: centroid of Cj.
– rij: centroid of Cij.
D C , C   x  r   x  r   x  r 

2 2 2
w i j i j ij
xCi xC j xCij
Hierarchical clustering: Time and
space requirements IBM ICE (Innovation Centre for Education)
IBM Power Systems
• For a dataset X consisting of n points.
• O(n2) space: it requires storing the distance matrix.
• O(n3) time in most of the cases.

– There are n steps and at each step the size n2 distance matrix must be updated and searched.
– Complexity can be reduced to O(n2 log(n)) time for some approaches by using appropriate data
structures.
K-Means clustering IBM ICE (Innovation Centre for Education)
IBM Power Systems
• K-Means clustering is a type of unsupervised learning, which is used when we have

unlabeled data (i.e., data without defined categories or groups).
• K-Means algorithm will divide the given data into K clusters.
• This algorithm works iteratively to assign each data point to one of K groups.
• In order to divide the data into groups it utilises the unique feature of data objects.
• Data points are clustered based on feature match score. Match score is a measure of
similarity or dissimilarity.
• The output of K-Means algorithms are the K clusters and the centroids of each of the
clusters.
Importance of choosing initial centroids IBM ICE (Innovation Centre for Education)
IBM Power Systems
Iteration 1 Iteration 2 Iteration 3

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 4 Iteration 5 Iteration 6

3 3 3
2.5 2.5 2.5
2 2 2
1.5 1.5 1.5

y
y
1 1 1
0.5 0.5 0.5
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
The K-medoids clustering method IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Find representative objects called medoids in clusters.
• PAM (Partitioning Around Medoids, 1987).

– starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-
medoids if it improves the total distance of the resulting clustering.
– PAM works effectively for small data sets, but does not scale well for large data sets.
• CLARA (Kaufmann & Rousseeuw, 1990).
• CLARANS (Ng & Han, 1994): Randomized sampling.
• Focusing + spatial data structure (Ester et al., 1995).

PAM (Partitioning Around Medoids) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• PAM (Kaufman and Rousseeuw, 1987), built in Splus.
• Use real object to represent the cluster.

– Select k representative objects arbitrarily.
– For each pair of non-selected object h and selected object i, calculate the total swapping cost Tcih.
– For each pair of i and h,
• If TCih < 0, i is replaced by h.
• Then assign each non-selected object to the most similar representative object.
– repeat steps 2-3 until there is no change
CLARA (Clustering Large Applications) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• It draws multiple samples of the data set, applies PAM on each sample, and gives the best
clustering as the output.
• Strength
– Deals with larger data sets than PAM.
• Weakness
– Efficiency depends on the sample size.
– A good clustering based on samples will not necessarily represent a good clustering of the whole data
set if the sample is biased.
CLARANS (Randomized CLARA) IBM ICE (Innovation Centre for Education)
IBM Power Systems
• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94).
• CLARANS draws sample of neighbors dynamically.
• The clustering process can be presented as searching a graph where every node is a
potential solution, that is, a set of k medoids.
• If the local optimum is found, CLARANS starts with new randomly selected node in search
for a new local optimum.
• It is more efficient and scalable than both PAM and CLARA.
• Focusing techniques and spatial access structures may further improve its performance
(Ester et al.’95).
Density based clustering methods IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Clustering based on density (local cluster criterion) such as density-connected points.
• Major features:
– Discover clusters of arbitrary shape.
– Handle noise.
– One scan.
– Need density parameters as termination condition.
• Several interesting studies:

– DBSCAN: Ester, et al. (KDD’96).
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98).
– CLIQUE: Agrawal, et al. (SIGMOD’98).
DBSCAN: Density Based Spatial Clustering of
Applications with Noise IBM ICE (Innovation Centre for Education)
IBM Power Systems
Outlier
Border
Eps = 1cm
MinPts = 5
Core
When DBSCAN does not work well IBM ICE (Innovation Centre for Education)
IBM Power Systems
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Sec. 16.3
External criteria for clustering quality IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Quality measured by its ability to discover some or all of the hidden patterns or latent classes
in gold standard data.
• Assesses a clustering with respect to ground truth requires labeled data.
• Assume documents with C gold standard classes, while our clustering algorithms produce K
clusters, ω1, ω2, …, ωK with ni members.
• Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of
cluster ωi
1
Purity (i )  max j (nij ) j C
ni
• Biased because having n clusters maximizes purity.
• Others are entropy of classes in clusters (or mutual information between classes and
clusters).
Different aspects of cluster validation IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random
structure actually exists in the data.
• Comparing the results of a cluster analysis to externally known results, e.g., to externally
given class labels.
• Evaluating how well the results of a cluster analysis fit the data without reference to external
information.
– Use only the data.
• Comparing the results of two different sets of cluster analyses to determine which is better.
• Determining the ‘correct’ number of clusters.
• For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering
or just individual clusters.
Measures of cluster validity IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Numerical measures that are applied to judge various aspects of cluster validity, are
classified into the following three types.
– External Index: Used to measure the extent to which cluster labels match externally supplied class
labels.
• Entropy.
– Internal Index: Used to measure the goodness of a clustering structure without respect to external
information.
• Sum of Squared Error (SSE).
– Relative Index: Used to compare two different clusterings or clusters.
• Often an external or internal index is used for this function, e.g., SSE or entropy.
• Sometimes these are referred to as criteria instead of indices.

– However, sometimes criterion is the general strategy and index is the numerical measure that
implements the criterion.
Measuring cluster validity via correlation IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Two matrices
– Proximity matrix.
– “Incidence” matrix.
• One row and one column for each data point.
• An entry is 1 if the associated pair of points belong to the same cluster.
• An entry is 0 if the associated pair of points belongs to different clusters.
• Compute the correlation between the two matrices

– Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries calculated.
• High correlation indicates that points that belong to the same cluster are close to each other.
• Not a good measure for some density or contiguity based clusters.

Using similarity matrix for
cluster validation IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Order the similarity matrix with respect to cluster labels and inspect visually.
1
1
0.9
10 0.9
0.8
20 0.8
0.7
30 0.7
0.6
40 0.6
Points
0.5
y
50 0.5
0.4
60 0.4
0.3
70 0.3
0.2
80 0.2
0.1
90 0.1
0
0 0.2 0.4 0.6 0.8 1 100 0
x 20 40 60 80 100 Similarity
Points
Internal measures: SSE IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Clusters in more complicated figures aren’t well separated.
• Internal Index: Used to measure the goodness of a clustering structure without respect to
external information.
– SSE.
• SSE is good for comparing two clustering's or two clusters (average SSE).
• Can also be used to estimate the number of clusters.

10
6 9
8
4
7
2 6
SSE 5
0 4
3
-2
2
-4 1
0
-6 2 5 10 15 20 25 30
K
5 10 15
Framework for cluster validity IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Need a framework to interpret any measure.

– For example, if our measure of evaluation has the value, 10, is that good, fair, or poor?
• Statistics provide a framework for cluster validity.

– The more “atypical” a clustering result is, the more likely it represents valid structure in the data.
– Can compare the values of an index that result from random data or clusterings to those of a
clustering result.
• If the value of the index is unlikely, then the cluster results are valid
– These approaches are more complicated and harder to understand.
• For comparing the results of two different sets of cluster analyses, a framework is less
necessary.
– However, there is the question of whether the difference between two index values is significant.
Internal measures: Cohesion
and separation IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Cluster cohesion: Measures how closely related are objects in a cluster.

– Example: SSE.
• Cluster separation: Measure how distinct or well-separated a cluster is from other clusters.
– Example: Squared error.
– Cohesion is measured by the within cluster sum of squares.
– Separation is measured by the between cluster sum of squares.
WSS    ( x  mi )2
i xC i
– Where |Ci| is the size of cluster i
BSS   Ci ( m  mi ) 2
i
Internal measures: Silhouette coefficient IBM ICE (Innovation Centre for Education)
IBM Power Systems
• Silhouette coefficient combine ideas of both cohesion and separation, but for individual
points, as well as clusters and clusterings.
• For an individual point, i

– Calculate a = average distance of i to the points in its cluster.
– Calculate b = min (average distance of i to points in another cluster).
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)
– Typically between 0 and 1. b
– The closer to 1 the better. a
• Can calculate the Average Silhouette width for a cluster or a clustering.

Checkpoint (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Multiple choice questions:
1. Which of the following algorithm is most sensitive to outliers?

a) K-means clustering algorithm
b) K-medians clustering algorithm
c) K-modes clustering algorithm
d) K-medoids clustering algorithm
2. _____ is a clustering procedure characterized by the development of a tree-like structure.
a) Non-hierarchical clustering
b) Hierarchical clustering
c) Divisive clustering
d) Agglomerative clustering
3. The _____ method uses information on all pairs of distances, not merely the minimum or
maximum distances.
a) Single linkage
b) Medium linkage
c) Complete linkage
d) Average linkage
Checkpoint solutions (1 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Multiple choice questions:
1. Which of the following algorithm is most sensitive to outliers?

a) K-means clustering algorithm
b) K-medians clustering algorithm
c) K-modes clustering algorithm
d) K-medoids clustering algorithm
2. _____ is a clustering procedure characterized by the development of a tree-like structure.
a) Non-hierarchical clustering
b) Hierarchical clustering
c) Divisive clustering
d) Agglomerative clustering
3. The _____ method uses information on all pairs of distances, not merely the minimum or
maximum distances.
a) Single linkage
b) Medium linkage
c) Complete linkage
d) Average linkage
Checkpoint (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Fill in the blanks:
1. __is the minimum no. of variables/ features required to perform clustering.

2. ______ method is used for finding optimal of cluster in K-means algorithm.
3. ______ is a technique for analyzing data when the criterion or dependent variable is
categorical and the independent variables are interval in nature.
True or False:
1. A dendrogram is not possible for K-Means clustering analysis. True/False

2. Unsupervised learning is a learning from unlabeled data using factor and cluster analysis
models. True/False
3. Automated vehicle is an example of supervised learning. True/False
Checkpoint solutions (2 of 2) IBM ICE (Innovation Centre for Education)
IBM Power Systems
Fill in the blanks:
1. One is the minimum no. of variables/ features required to perform clustering.

2. Elbow method is used for finding optimal of cluster in K-means algorithm.
3. Cluster analysis is a technique for analyzing data when the criterion or dependent variable
is categorical and the independent variables are interval in nature.
True or False:
1. A dendrogram is not possible for K-Means clustering analysis. True

2. Unsupervised learning is a learning from unlabeled data using factor and cluster analysis
models. True
3. Automated vehicle is an example of supervised learning. False
Question bank IBM ICE (Innovation Centre for Education)
IBM Power Systems
Two marks questions:
1. Define unsupervised learning.

2. Define clustering. Give an example.
3. Define cluster ensemble process.
4. Define Dendrogram.
Four marks questions:
1. What are the goals of clustering?

2. Justify ‘Clustering is a pre-processing tool’.
3. Write a note on the requirements of clustering.
4. Discuss the type of data in clustering analysis.
Eight marks questions:
1. Explain k-means clustering technique. Analyze its computational complexity.

2. Discuss DBSCAN algorithm. Comment on its time complexity.
Unit summary IBM ICE (Innovation Centre for Education)
IBM Power Systems
Having completed this unit you should be able to:
• Understand the concept of clustering and clustering algorithm
• Gain an insight into clustering application
• Gain knowledge on types of data in clustering analysis
• Understand the concept of measures of cluster validity

Clustering Techniques: Welcome To

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Techniques: Welcome To

Uploaded by

Copyright:

Available Formats

IBM ICE (Innovation Centre for Education)

After completing this unit you should be able to:

• Understand the concept of clustering and clustering algorithm

• Gain an insight into clustering application

• Gain knowledge on types of data in clustering analysis

• Understand the concept of measures of cluster validity

• It is the process of grouping a set of objects into classes of similar objects.

• Clustering approaches group the data based on certain similarity of features.

• Clustering methods generally use centroid-based and hierarchal kind of modelling

• Well known clustering algorithms are:

• Agglomeration schedule: Gives information on the objects or cases being combined at

• Dendrogram (A tree graph): A graphical device for displaying clustering results.

• Icicle diagram: Another type of graphical display of clustering results.

• Spatial data analysis.

• Economic science (especially market research).

• Summarization: Preprocessing for regression, PCA, classification, and association analysis.

• Finding K-nearest neighbors

• Hard clustering: Each object belongs to exactly one cluster.

• Soft clustering: An object can belong to more than one cluster.

• Some popular ones include Minkowski distance:

• Different types of data in clustering analysis are:

• A contingency table for binary data

• Simple matching coefficient (invariant, if the binary variable is symmetric).

• Jaccard coefficient (non-invariant if the binary variable is asymmetric):

d (i, j)  bc d (i , j )  bc

• Method 1: Simple matching.

• Method 2: use a large number of binary variables.

• An ordinal variable can be discrete or continuous.

• Order is important, e.g., rank.

• Can be treated like interval-scaled:

• Some of the clustering approaches are

• Different types of clusters are:

• Produces a set of nested clusters organized as a hierarchical tree.

• Can be visualized as a dendrogram:

• Does not require the number of clusters k in advance.

• Needs a termination/readout condition.

• Starts with each doc in a separate cluster

• The history of merging forms a binary tree or hierarchy.

• How to measure distance of clusters??

• Assumes a similarity function for determining the similarity of two instances.

• The history of merging forms a binary tree or hierarchy.

• Variance methods generate clusters to minimize the within-cluster variance.

• Tends to produce long, “loose” clusters.

• The distance is defined by the two most dissimilar objects

Dcentroids Ci , C j   d (ri , rj )

D C , C   x  r   x  r   x  r 

• For a dataset X consisting of n points.

• O(n2) space: it requires storing the distance matrix.

• O(n3) time in most of the cases.

• K-Means clustering is a type of unsupervised learning, which is used when we have

• K-Means algorithm will divide the given data into K clusters.

Iteration 1 Iteration 2 Iteration 3

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

Iteration 4 Iteration 5 Iteration 6

2.5 2.5 2.5

1.5 1.5 1.5

0.5 0.5 0.5

• Find representative objects called medoids in clusters.

• PAM (Partitioning Around Medoids, 1987).

• CLARA (Kaufmann & Rousseeuw, 1990).