You are on page 1of 20

Data Warehousing and Data Mining Unit 12

Unit 12 Clustering
Structure:
12.1 Introduction
Objectives
12.2 Clustering
12.3 Cluster Analysis
12.4 Clustering Methods
K-means
Hierarchical clustering
Agglomerative clustering
Divisive clustering
12.5 Clustering and Segmentation Software
12.6 Evaluating Clusters
12.7 Summary
12.8 Terminal Questions
12.9 Answers

12.1 Introduction
In this unit you are going to be familiarized with clustering .Clustering and
the Nearest Neighbor prediction technique are among the oldest techniques
used in Data Mining. Clustering is the process of grouping similar records.
Nearest neighbor is a prediction technique that is quite similar to clustering
.Its essence is that in order to predict the prediction value of one record look
for records with similar predictor values in the historical database .then use
the prediction value from the record that is nearest to the unclassified
record.
Objectives:
After studying this unit, you should be able to:
discuss cluster analysis and cluster methods.
describe the K-means algorithm
explain agglomerative algorithm
explain divisive algorithm
describe clustering and segmentation software.

Sikkim Manipal University Page No.: 173


Data Warehousing and Data Mining Unit 12

12.2 Clustering
Clustering is the method by which like records are grouped together.
Usually this is done to give the end user a high level view of what is going
on in the database. Clustering is sometimes used to mean segmentation
which most marketing people will tell you is useful for coming up with a birds
eye view of the business. Clustering is a form of learning by observation
rather than learning by examples.
Clustering is the classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so that
the data in each subset (ideally) share some common characteristics.
These can be used to help you understand the business better and also
exploited to improve future performance through predictive analytics. For
example, data mining can warn you theres a high probability that a specific
customer wont pay on time. This is based on an analysis of customers with
similar characteristics.
Cluster Representation

Fig. 12.1: Cluster Representation

In this case we easily identify the 4 clusters into which the data can be
divided. The similarity criterion is distance: two or more objects belong to the
same cluster if they are close according to a given distance (in this case
geometrical distance). This is called distance-based. Another kind of
clustering is conceptual clustering: two or more objects belong to the same
cluster if this one defines a concept common to all that objects. In other
words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures.

Sikkim Manipal University Page No.: 174


Data Warehousing and Data Mining Unit 12

A simple example of clustering would be the clustering that most people


perform when they do the laundry grouping the permanent press, dry
cleaning, whites and brightly colored clothes is important because they have
similar characteristics. And it turns out they have important attributes in
common about the way they behave (and can be ruined) in the wash. To
cluster your laundry most of your decisions are relatively straightforward.
There are of course difficult decisions to be made about which cluster your
white shirt with red stripes goes into (since it is mostly white but has some
color and is permanent press). When clustering is used in business the
clusters are often much more dynamic even changing weekly to monthly
and many more of the decisions concerning which cluster a record falls into
can be difficult.
Clustering Applications
Clustering algorithms can be applied in many fields. For instance, in
Marketing, finding groups of customers with similar behavior given a large
database of customer data containing their properties and past buying
records;
Biology: classification of plants and animals given their features;
Libraries: book ordering;
Insurance: identifying groups of motor insurance policy holders with a
high average claim cost; identifying frauds;
City-planning: identifying groups of houses according to their house
type, value and geographical location;
Earthquake studies: clustering observed earthquake epicenters to
identify dangerous zones;
WWW: document classification; clustering web log data to discover
groups of similar access patterns.

12.3 Cluster Analysis


Cluster analysis is an exploratory data analysis tool for solving classification
problems. Its objective is to sort cases (people, things, events, etc.) into
groups, or clusters, so that the degree of association is strong between
members of the same cluster and weak between members of different
clusters.

Sikkim Manipal University Page No.: 175


Data Warehousing and Data Mining Unit 12

Finding groups of objects such that the objects in a group will be similar
(or related) to one another and different from (or unrelated to) the objects in
other groups.
Clustering basically provides classification. Such a classification may help:
Formulate hypotheses concerning the origin of the sample, e.g. in
evolution studies.
Describe a sample in terms of a typology, e.g. for market analysis or
administrative purposes.
Predict the future behavior of population types, e.g. in modelling
economic prospects for different industry sectors.
Optimize functional processes, e.g. business site locations or product
design.
Assist in identification, e.g. in diagnosing diseases.
Measure the different effects of treatments on classes within the
population, e.g. with analysis of variance.
Some examples of cluster analysis applications which illustrate the wide
diversity of the subject are listed below.
National voting patterns at the UN
Geological data transmitted from earth satellites
Chemical compounds by their reactive properties
Structure of cell nucleus by genomic components
Infant mortality from family, ethnic and economic data
Publications in a branch of modern physics
Data base relationships in management information systems.

Self Assessment Questions


1. The process of grouping a set of physical or abstract objects into
classes of similar objects is called __________________ .
2. Clustering may also be considered as _____________________.
3. Clustering is also called:
a. Segmentation
b. Compression
c. Partitions with similar objects
d. All the above

Sikkim Manipal University Page No.: 176


Data Warehousing and Data Mining Unit 12

12.4 Clustering Methods


There are four methods of clustering. They are
K-means
Hierarchical
Agglomerative
Divisive
Each of these methods is described in the following sections:
12.4.1 K-means
K-means (MacQueen, 1967) is one of the simplest unsupervised learning
algorithms that solve the well known clustering problem. The procedure
follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to
define k centroids, one for each cluster.
The basic step of k-means clustering is simple. In the beginning we
determine number of cluster K and we assume the centroid or center of
these clusters. We can take any random objects as the initial centroids or
the first K objects in sequence can also serve as the initial centroids.
Then the K means algorithm will do the three steps given below until
convergence iterate until stable (= no object move group)
1. Determine the centroid coordinate
2. Determine the distance of each object to the centroids
3. Group the object based on minimum distance
These steps are given in the form of flow chart. (See fig. 12.2 )

Fig. 12.2: Flow chart representation of K-means


Sikkim Manipal University Page No.: 177
Data Warehousing and Data Mining Unit 12

Example 12.1: Suppose we have 4 objects as your training data points and
each object has 2 attributes. Each attribute represents coordinate of the
object.
Table 12.1: Training Data

Attribute 1 (X):weight Attribute 2 (Y): pH


Object
index
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4

We also know before hand that these objects belong to two groups of
medicine (cluster 1 and cluster 2).
Each medicine represents one point with two attributes (X, Y) that we can
represent it as coordinate in an attribute space as shown in the figure 12.3.

Fig. 12.3: Weight Index Matrix

1) Initial Value of Centroids: Suppose we use medicine A and medicine B


as the first centroids. Let C1 and C2 denote the coordinate of the centroids,
then C1=(1,1) and C2=(2,1).

Sikkim Manipal University Page No.: 178


Data Warehousing and Data Mining Unit 12

Fig. 12.4: Weight Index Matrix

2) Objects Centroids Distance: We calculate the distance between


cluster centroid to each object. Let us use Euclidean distance, then we have
distance matrix at iteration 0 is

Each column in the distance matrix symbolizes the object. The first row of
the distance corresponds to the distance of each object to the first centroid
and the second row is the distance of each object to the second centroid.
For example, distance from medicine C = (4, 3) to the first centroid C1=(1,1,)
is sqrt ((4 -1) 2 +(3 - 1)2) =3.61, and its distance to the second centroid
c2=(2,1) is Sqrt ( (4-1)2+(3-1)2 ) =2.83 etc.
3) Object Clustering: We assign each object based on the minimum
distance. Thus, medicine A is assigned to group 1, medicine B to group 2,
medicine C to group 2 and medicine D to group 2. The element of Group
matrix below is 1 if and only if the object is assigned to that group.

Sikkim Manipal University Page No.: 179


Data Warehousing and Data Mining Unit 12

4) Iteration-1, Determine Centroids: Knowing the members of each group,


now we compute the new centroid of each group based on these new
memberships. Group 1 has only one member, thus the centroid remains in
C1=(1,1). Group 2 now has three members, thus the centroid is the average
coordinate among the three members.

Fig. 12.5: Weight Index Matrix - Iteration _1

5) Iteration-1, Object - Centroid Distances: The next step is to compute


the distance of all objects to the new centroids. Similar to step 2, we have
distance matrix at iteration 1 is:

Sikkim Manipal University Page No.: 180


Data Warehousing and Data Mining Unit 12

6) Iteration-1, Object Clustering: Similar to step 3, we assign each object


based on the minimum distance. Based on the new distance matrix, we
move the medicine B to Group 1 while all the other objects remain. The
Group matrix is shown below:

7) Iteration 2, Determine Centroids: Now, we repeat step 4. to calculate


the new centroids coordinate based on the clustering of previous iteration.
Group1 and group 2 both have two members, thus the new centroids are:

and
8) Iteration-2, Objects-Centroids Distances: Repeat step 2 again. we
have new distance matrix at iteration 2 as:

Fig. 12.6: weight index matrix interation-2

Sikkim Manipal University Page No.: 181


Data Warehousing and Data Mining Unit 12

9) Iteration- 2, Object Clustering: Again, we assign each object based on


the minimum distance.

We obtain result G2 = G1. Comparing the grouping of last iteration and this
iteration reveals that the objects do not move the group anymore. Thus, the
computation of the k-mean clustering has reached its stability and no more
iteration is needed. We get the final grouping as the results (Table 12.2).
Table 12.2: Grouping Results

Object Feature 1 (X): Feature 2 (Y): pH Group (result)


weight index
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

Advantages:
With a large number of variables, K-Means may be computationally
faster than hierarchical clustering (if K is small).
K-Means may produce tighter clusters than hierarchical clustering,
especially if the clusters are globular.
The K-means method as described has the following drawbacks:
It does not do well with overlapping clusters.
The clusters are easily pulled off-center by outliers.
Each record is either inside or outside of a given cluster.
The basic K-means algorithm has many variations. Many commercial
software tools that include automatic cluster detection incorporate some of
these variations. There are several different approaches to clustering,
including agglomerative clustering, divisive clustering, and self-organizing
maps. (we are not discussing this method in this unit)

Sikkim Manipal University Page No.: 182


Data Warehousing and Data Mining Unit 12

12.4.2 Hierarchical Clustering


In hierarchical clustering the data are not partitioned into a particular cluster
in a single step. Instead, a series of partitions takes place, which may run
from a single cluster containing all objects to n clusters each containing a
single object. Hierarchical Clustering is subdivided into agglomerative
methods, which proceed by series of fusions of the n objects into groups,
and divisive methods, which separate n objects successively into finer
groupings. Agglomerative techniques are more commonly used, and this is
the method implemented in XLMiner. Hierarchical clustering may be
represented by a two-dimensional diagram known as dendrogram, which
illustrates the fusions or divisions made at each successive stage of
analysis. An example of such a dendrogram is given below:

Fig. 12.7: Agglomerative and Divisive Clustering

Algorithm
Given a set of N items to be clustered, and an NxN distance (or similarity)
matrix, the basic process of Johnson's (1967) hierarchical clustering is this:
1) Start by assigning each item to its own cluster, so that if you have N
items, you now have N clusters, each containing just one item. Let the
distances (similarities) between the clusters equal the distances
(similarities) between the items they contain.
2) Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
3) Compute distances (similarities) between the new cluster and each of
the old clusters.
4) Repeat steps 2 and 3 until all items are clustered into a single cluster of
size N.

Sikkim Manipal University Page No.: 183


Data Warehousing and Data Mining Unit 12

Step 3 can be done in different ways, which is what distinguishes single-link


from complete-link and average-link clustering. In single-link clustering (also
called the connectedness or minimum method), we consider the distance
between one cluster and another cluster to be equal to the shortest distance
from any member of one cluster to any member of the other cluster. If the
data consist of similarities, we consider the similarity between one cluster
and another cluster to be equal to the greatest similarity from any member
of one cluster to any member of the other cluster. In complete-link clustering
(also called the diameter or maximum method), we consider the distance
between one cluster and another cluster to be equal to the longest distance
from any member of one cluster to any member of the other cluster. In
average-link clustering, we consider the distance between one cluster and
another cluster to be equal to the average distance from any member of one
cluster to any member of the other cluster.
12.4.3 Agglomerative Clustering
The K-means approach to clustering starts out with a fixed number of
clusters and allocates all records into exactly that number of clusters.
Another class of methods works by agglomeration. These methods start out
with each data point forming its own cluster and gradually merge them into
larger and larger clusters until all points have been gathered together into
one big cluster. Toward the beginning of the process, the clusters are very
small and very pure. The members of each cluster are few and closely
related. Towards the end of the process, the clusters are large and not as
well defined. The entire history is preserved making it possible to choose the
level of clustering that works best for a given application.
Algorithm
The first step is to create a similarity matrix. The similarity matrix is a table of
all the pair-wise distances or degrees of similarity between clusters. Initially,
the similarity matrix contains the pair-wise distance between individual pairs
of records. As discussed earlier, there are many measures of similarity
between records, including the Euclidean distance, the angle between
vectors, and the ratio of matching to nonmatching categorical fields. The
issues raised by the choice of distance measures are exactly the same as
those previously discussed in relation to the K-means approach. It might
seem that with N initial clusters for N data points, N2 measurement
calculations are required to create the distance table. If the similarity
Sikkim Manipal University Page No.: 184
Data Warehousing and Data Mining Unit 12

measure is a true distance metric, only half that is needed because all true
distance metrics follow the rule that Distance (X,Y) = Distance(Y,X). In the
vocabulary of mathematics, the similarity matrix is lowering triangular. The
next step is to find the smallest value in the similarity matrix. This identifies
the two clusters that are most similar to one another. Merge these two
clusters into a new one and update the similarity matrix by replacing the two
rows that described the parent cluster with a new row that describes the
distance between the merged cluster and the remaining clusters. There are
now N 1 clusters and N 1 rows in the similarity matrix. Repeat the merge
step N 1 times, so all records belong to the same large cluster. Each
iteration remembers which clusters were merged and the distance between
them. This information is used to decide which level of clustering to make
use of.
Subsequent trips through the loop need to update the similarity matrix with
the distances from the new, multirecord cluster to all the others. How do we
measure this distance?
As usual, there is a choice of approaches. Three common ones are:
Single Linkage
Complete Linkage
Centroid Distance
In the single linkage method, the distance between two clusters is given by
the distance between the closest members. This method produces clusters
with the property that every member of a cluster is more closely related to at
least one member of its cluster than to any point outside it.
Another approach is the complete linkage method, where the distance
between two clusters is given by the distance between their most distant
members. This method produces clusters with the property that all members
lie within some known maximum distance of one another.
Third method is the centroid distance, where the distance between two
clusters is measured between the centroids of each. The centroid of a
cluster is its average element. Figure 12.8 gives a pictorial representation of
these three methods.

Sikkim Manipal University Page No.: 185


Data Warehousing and Data Mining Unit 12

Fig. 12.8: Single linkage, complete linkage, Centroid distance

The agglomeration algorithm creates hierarchical clusters. At each level in


the hierarchy, clusters are formed from the union of two clusters at the next
level down. A good way of visualizing these clusters is as a tree.
Clustering people by age: an example of agglomerative clustering
This illustration of agglomerative clustering uses an example in one
dimension with the single linkage measure for distance between clusters.
These choices make it possible to follow the algorithm through all its
iterations without having to worry about calculating distances using squares
and square roots. The data consists of the ages of people at a family
gathering. The goal is to cluster the participants using their age, and the
metric for the distance between two people is simply the difference in their
ages. The metric for the distance between two clusters of people is the
difference in age between the oldest member of the younger cluster and the
youngest member of the older cluster.
Because the distances are so easy to calculate, the example dispenses with
the similarity matrix. The procedure is to sort the participants by age, then

Sikkim Manipal University Page No.: 186


Data Warehousing and Data Mining Unit 12

begin clustering by first merging clusters that are 1 year apart, then 2 years,
and so on until there is only one big cluster.
Figure 12.9 shows the clusters after six iterations, with three clusters
remaining. This is the level of clustering that seems the most useful. The
algorithm appears to have clustered the population into three generations:
children, parents, and grandparents.

Fig. 12.9: Dendrogram Representation for Hierarchical Clustering

12.4.4 Divisive Clustering


So far we have only looked at agglomerative clustering, but a cluster
hierarchy can also be generated top-down. This variant of hierarchical
clustering is called top-down clustering or divisive clustering. We start at the
top with all documents in one cluster. The cluster is split using a flat
clustering algorithm. This procedure is applied recursively until each
document is in its own singleton cluster.
Top-down clustering is conceptually more complex than bottom-up
clustering since we need a second, flat clustering algorithm as a
subroutine. It has the advantage of being more efficient if we do not
generate a complete hierarchy all the way down to individual document

Sikkim Manipal University Page No.: 187


Data Warehousing and Data Mining Unit 12

leaves. For a fixed number of top levels, using an efficient flat algorithm like
K-means, top-down algorithms are linear in the number of documents and
clusters. The procedure for divisive clustering is given as follows,
Algorithm:
Divisive Clustering starts by placing all objects into a single group. Before
we start the procedure, we need to decide on a threshold distance. Once
this is done, the procedure is as follows:
1) The distance between all pairs of objects within the same group is
determined and the pair with the largest distance is selected.
2) This maximum distance is compared to the threshold distance.
a. If it is larger than the threshold, this group is divided into two. This is
done by placing the selected pair into different groups and using
them as seed points. All other objects in this group are examined,
and are placed into the new group with the closest seed point. The
procedure then returns to Step 1.
b. If the distance between the selected objects is less than the
threshold, the divisive clustering stops.
To run divisive clustering, you simply need to decide upon a method of
measuring the distance between two objects.
Self Assessment Questions
4. Clustering is used only in data mining (True/False).
5. Clustering is a form of learn by observation rather than _____________.
6. Weight and height of an individual fall into ________________ kind of
variables.
7. In the K-means algorithm for partitioning, each cluster is represented by
the _____________ of objects in the cluster.
8. K-means clustering requires prior knowledge about number clusters
required as its input.(True/False).
9. One form of unsupervised learning is ___________.

12.5 Clustering and Segmentation Software


Commercial softwares are proprietary. To obtain these, software license is
compulsory. For each license, a separate fee is also compulsory.

Sikkim Manipal University Page No.: 188


Data Warehousing and Data Mining Unit 12

A list of the commercial software is given below:


BayesiaLab includes Bayesian classification algorithms for data
segmentation and uses Bayesian networks to automatically cluster the
variables.
ClustanGraphics3, hierarchical cluster analysis from the top, with
powerful graphics.
CViz Cluster Visualization, for analyzing large high-dimensional datasets
provides full-motion cluster visualization.
IBM Intelligent Miner for Data includes 2 clustering algorithms.
Neusciences aXi.Kohonen, ActiveX Control for Kohonen Clustering,
includes a Delphi interface.
PolyAnalyst offers clustering based on Localization of Anomalies (LA)
algorithm.
StarProbe, cross-platform, very fast on big data, star schema support,
special tools and features for data with rich categorical dimensional
information.
Visipoint, Self-Organizing Map clustering and visualization.

Free (Open source)


This category of software is freely available for the public use. Most of the
software is available in the Internet. Since it is a free open source, we can
modify the code if required. The best open source software for OS
(operating system) is Linux. Some of the free open source softwares list is
given below:
Autoclass C, an unsupervised Bayesian classification system from
NASA, available for Unix and Windows.
CLUTO, provides a set of partitioned clustering algorithms that treat the
clustering problem as an optimization process.
Databionic ESOM Tools, a suite of programs for clustering,
visualization, and classification with Emergent Self-Organizing Maps
(ESOM).
David Dowe Mixture Modeling page for modeling statistical distribution
by a mixture (or weighted sum) of other distributions.
MCLUST/EMCLUST, model-based cluster and discriminant analysis,
including hierarchical clustering. In Fortran with interface to S-PLUS.

Sikkim Manipal University Page No.: 189


Data Warehousing and Data Mining Unit 12

PermutMatrix, graphical software for clustering and seriation analysis,


with several types of hierarchical clusters analysis and several methods
to find an optimal reorganization of rows and columns.
PROXIMUS, a tool for compressing, clustering and discovering patterns
in discrete datasets.
Snob, MML (Minimum Message Length)-based program for clustering.
SOM in Excel from Angshuman Saha.
StarProbe, web-based multi-user server available for academic
institutions.

12.6 Evaluating Clusters


When using the K-means approach to cluster detection, is there a way to
determine what value of K finds the best clusters? Similarly, when using a
hierarchical approach, is there a test for which level in the hierarchy
contains the best clusters? What does it mean to say that a cluster is good?
These questions are important when using clustering in practice. In general
terms, clusters should have members that have a high degree of similarity
or, in geometric terms, are close to each other and the clusters
themselves should be widely spaced. A standard measure of within-cluster
similarity is variance (the sum of the squared differences of each element
from the mean), so the best set of clusters might be the set whose clusters
have the lowest variance. However, this measure does not take into account
cluster size.
A similar measure would be the average variance, the total variance divided
by the size of the cluster. For agglomerative clustering, using variance as a
measure does not make sense, since this method always starts out with
clusters of one which, of course, have variance zero. A good measure to
use with agglomerative clusters is the difference between the distance value
at which it was formed and the distance value at which it is merged into the
next level. This is a measure of the durability of the cluster.
Self Assessment Questions
10. ______ software provides a set of partitioned clustering algorithms that
treat the clustering problem as an optimization process.
11. A standard measure of within-cluster similarity is _______.

Sikkim Manipal University Page No.: 190


Data Warehousing and Data Mining Unit 12

12.7 Summary
Data Mining is the process of running data through sophisticated
algorithms to uncover meaningful patterns and correlations that may
otherwise be hidden.
Clustering is the classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common
characteristics.
Clustering can be applied in the areas like marketing, finance, insurance
etc.
clustering and segmentation softwares are available, which includes
o BayesiaLab includes Bayesian classification algorithms
o Cluster 3.0 (open source)
IBM Intelligent Miner for Data, includes 2 clustering algorithms
o Neusciences aXi. Kohonen, ActiveX Control for Kohonen Clustering,
includes a Delphi interface.
o SAS-Enterprise Minor (licensed). etc
The K-means approach to clustering starts out with a fixed number of
clusters and allocates all records into exactly that number of clusters.
Agglomerative Clustering start out with each data point forming its own
cluster and gradually merge them into larger and larger clusters until all
points have been gathered together into one big cluster.

12.8 Terminal Questions


1. What is Clustering? Explain K-means clustering method with an
example.
2. Explain hierarchical clustering with an example.
3. What is divisive clustering? Write algorithmic steps for the divisive
clustering.
4. What is un-supervised learning? When is it required for analysis?

12.9 Answers
Self Assessment Questions
1. Cluster
2. Segmentation
3. D:All the above
Sikkim Manipal University Page No.: 191
Data Warehousing and Data Mining Unit 12

4. True
5. By example
6. Continuous
7. Means
8. True
9. Clustering
10. CLUTO
11. variance
Terminal Questions
1. Clustering is the classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters).The
basic step of k-means clustering is simple. In the beginning we
determine number of cluster K and we assume the centroid or center of
these clusters. We can take any random objects as the initial centroids
or the first K objects in sequence can also serve as the initial centroids.
(Refer section 12.4.1).
2. In hierarchical clustering the data are not partitioned into a particular
cluster in a single step. Instead, a series of partitions takes place, which
may run from a single cluster containing all objects to n clusters each
containing a single object. (Refer section 12.4.2).
3. This variant of hierarchical clustering is called top-down clustering or
divisive clustering. Divisive Clustering starts by placing all objects into a
single group. (Refer section 12.4.4).
4. Unsupervised learning is a class of problems in which one seeks to
determine how the data are organized. Unsupervised learning is closely
related to the problem of density estimation in statistics.
Clustering is one form of unsupervised learning. This unsupervised
learning is required whenever you want to estimate/hold group
operations. (Refer section 12.4).

Sikkim Manipal University Page No.: 192

You might also like