Professional Documents
Culture Documents
MIT401 Unit 12-SLM PDF
MIT401 Unit 12-SLM PDF
Unit 12 Clustering
Structure:
12.1 Introduction
Objectives
12.2 Clustering
12.3 Cluster Analysis
12.4 Clustering Methods
K-means
Hierarchical clustering
Agglomerative clustering
Divisive clustering
12.5 Clustering and Segmentation Software
12.6 Evaluating Clusters
12.7 Summary
12.8 Terminal Questions
12.9 Answers
12.1 Introduction
In this unit you are going to be familiarized with clustering .Clustering and
the Nearest Neighbor prediction technique are among the oldest techniques
used in Data Mining. Clustering is the process of grouping similar records.
Nearest neighbor is a prediction technique that is quite similar to clustering
.Its essence is that in order to predict the prediction value of one record look
for records with similar predictor values in the historical database .then use
the prediction value from the record that is nearest to the unclassified
record.
Objectives:
After studying this unit, you should be able to:
discuss cluster analysis and cluster methods.
describe the K-means algorithm
explain agglomerative algorithm
explain divisive algorithm
describe clustering and segmentation software.
12.2 Clustering
Clustering is the method by which like records are grouped together.
Usually this is done to give the end user a high level view of what is going
on in the database. Clustering is sometimes used to mean segmentation
which most marketing people will tell you is useful for coming up with a birds
eye view of the business. Clustering is a form of learning by observation
rather than learning by examples.
Clustering is the classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so that
the data in each subset (ideally) share some common characteristics.
These can be used to help you understand the business better and also
exploited to improve future performance through predictive analytics. For
example, data mining can warn you theres a high probability that a specific
customer wont pay on time. This is based on an analysis of customers with
similar characteristics.
Cluster Representation
In this case we easily identify the 4 clusters into which the data can be
divided. The similarity criterion is distance: two or more objects belong to the
same cluster if they are close according to a given distance (in this case
geometrical distance). This is called distance-based. Another kind of
clustering is conceptual clustering: two or more objects belong to the same
cluster if this one defines a concept common to all that objects. In other
words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures.
Finding groups of objects such that the objects in a group will be similar
(or related) to one another and different from (or unrelated to) the objects in
other groups.
Clustering basically provides classification. Such a classification may help:
Formulate hypotheses concerning the origin of the sample, e.g. in
evolution studies.
Describe a sample in terms of a typology, e.g. for market analysis or
administrative purposes.
Predict the future behavior of population types, e.g. in modelling
economic prospects for different industry sectors.
Optimize functional processes, e.g. business site locations or product
design.
Assist in identification, e.g. in diagnosing diseases.
Measure the different effects of treatments on classes within the
population, e.g. with analysis of variance.
Some examples of cluster analysis applications which illustrate the wide
diversity of the subject are listed below.
National voting patterns at the UN
Geological data transmitted from earth satellites
Chemical compounds by their reactive properties
Structure of cell nucleus by genomic components
Infant mortality from family, ethnic and economic data
Publications in a branch of modern physics
Data base relationships in management information systems.
Example 12.1: Suppose we have 4 objects as your training data points and
each object has 2 attributes. Each attribute represents coordinate of the
object.
Table 12.1: Training Data
We also know before hand that these objects belong to two groups of
medicine (cluster 1 and cluster 2).
Each medicine represents one point with two attributes (X, Y) that we can
represent it as coordinate in an attribute space as shown in the figure 12.3.
Each column in the distance matrix symbolizes the object. The first row of
the distance corresponds to the distance of each object to the first centroid
and the second row is the distance of each object to the second centroid.
For example, distance from medicine C = (4, 3) to the first centroid C1=(1,1,)
is sqrt ((4 -1) 2 +(3 - 1)2) =3.61, and its distance to the second centroid
c2=(2,1) is Sqrt ( (4-1)2+(3-1)2 ) =2.83 etc.
3) Object Clustering: We assign each object based on the minimum
distance. Thus, medicine A is assigned to group 1, medicine B to group 2,
medicine C to group 2 and medicine D to group 2. The element of Group
matrix below is 1 if and only if the object is assigned to that group.
and
8) Iteration-2, Objects-Centroids Distances: Repeat step 2 again. we
have new distance matrix at iteration 2 as:
We obtain result G2 = G1. Comparing the grouping of last iteration and this
iteration reveals that the objects do not move the group anymore. Thus, the
computation of the k-mean clustering has reached its stability and no more
iteration is needed. We get the final grouping as the results (Table 12.2).
Table 12.2: Grouping Results
Advantages:
With a large number of variables, K-Means may be computationally
faster than hierarchical clustering (if K is small).
K-Means may produce tighter clusters than hierarchical clustering,
especially if the clusters are globular.
The K-means method as described has the following drawbacks:
It does not do well with overlapping clusters.
The clusters are easily pulled off-center by outliers.
Each record is either inside or outside of a given cluster.
The basic K-means algorithm has many variations. Many commercial
software tools that include automatic cluster detection incorporate some of
these variations. There are several different approaches to clustering,
including agglomerative clustering, divisive clustering, and self-organizing
maps. (we are not discussing this method in this unit)
Algorithm
Given a set of N items to be clustered, and an NxN distance (or similarity)
matrix, the basic process of Johnson's (1967) hierarchical clustering is this:
1) Start by assigning each item to its own cluster, so that if you have N
items, you now have N clusters, each containing just one item. Let the
distances (similarities) between the clusters equal the distances
(similarities) between the items they contain.
2) Find the closest (most similar) pair of clusters and merge them into a
single cluster, so that now you have one less cluster.
3) Compute distances (similarities) between the new cluster and each of
the old clusters.
4) Repeat steps 2 and 3 until all items are clustered into a single cluster of
size N.
measure is a true distance metric, only half that is needed because all true
distance metrics follow the rule that Distance (X,Y) = Distance(Y,X). In the
vocabulary of mathematics, the similarity matrix is lowering triangular. The
next step is to find the smallest value in the similarity matrix. This identifies
the two clusters that are most similar to one another. Merge these two
clusters into a new one and update the similarity matrix by replacing the two
rows that described the parent cluster with a new row that describes the
distance between the merged cluster and the remaining clusters. There are
now N 1 clusters and N 1 rows in the similarity matrix. Repeat the merge
step N 1 times, so all records belong to the same large cluster. Each
iteration remembers which clusters were merged and the distance between
them. This information is used to decide which level of clustering to make
use of.
Subsequent trips through the loop need to update the similarity matrix with
the distances from the new, multirecord cluster to all the others. How do we
measure this distance?
As usual, there is a choice of approaches. Three common ones are:
Single Linkage
Complete Linkage
Centroid Distance
In the single linkage method, the distance between two clusters is given by
the distance between the closest members. This method produces clusters
with the property that every member of a cluster is more closely related to at
least one member of its cluster than to any point outside it.
Another approach is the complete linkage method, where the distance
between two clusters is given by the distance between their most distant
members. This method produces clusters with the property that all members
lie within some known maximum distance of one another.
Third method is the centroid distance, where the distance between two
clusters is measured between the centroids of each. The centroid of a
cluster is its average element. Figure 12.8 gives a pictorial representation of
these three methods.
begin clustering by first merging clusters that are 1 year apart, then 2 years,
and so on until there is only one big cluster.
Figure 12.9 shows the clusters after six iterations, with three clusters
remaining. This is the level of clustering that seems the most useful. The
algorithm appears to have clustered the population into three generations:
children, parents, and grandparents.
leaves. For a fixed number of top levels, using an efficient flat algorithm like
K-means, top-down algorithms are linear in the number of documents and
clusters. The procedure for divisive clustering is given as follows,
Algorithm:
Divisive Clustering starts by placing all objects into a single group. Before
we start the procedure, we need to decide on a threshold distance. Once
this is done, the procedure is as follows:
1) The distance between all pairs of objects within the same group is
determined and the pair with the largest distance is selected.
2) This maximum distance is compared to the threshold distance.
a. If it is larger than the threshold, this group is divided into two. This is
done by placing the selected pair into different groups and using
them as seed points. All other objects in this group are examined,
and are placed into the new group with the closest seed point. The
procedure then returns to Step 1.
b. If the distance between the selected objects is less than the
threshold, the divisive clustering stops.
To run divisive clustering, you simply need to decide upon a method of
measuring the distance between two objects.
Self Assessment Questions
4. Clustering is used only in data mining (True/False).
5. Clustering is a form of learn by observation rather than _____________.
6. Weight and height of an individual fall into ________________ kind of
variables.
7. In the K-means algorithm for partitioning, each cluster is represented by
the _____________ of objects in the cluster.
8. K-means clustering requires prior knowledge about number clusters
required as its input.(True/False).
9. One form of unsupervised learning is ___________.
12.7 Summary
Data Mining is the process of running data through sophisticated
algorithms to uncover meaningful patterns and correlations that may
otherwise be hidden.
Clustering is the classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common
characteristics.
Clustering can be applied in the areas like marketing, finance, insurance
etc.
clustering and segmentation softwares are available, which includes
o BayesiaLab includes Bayesian classification algorithms
o Cluster 3.0 (open source)
IBM Intelligent Miner for Data, includes 2 clustering algorithms
o Neusciences aXi. Kohonen, ActiveX Control for Kohonen Clustering,
includes a Delphi interface.
o SAS-Enterprise Minor (licensed). etc
The K-means approach to clustering starts out with a fixed number of
clusters and allocates all records into exactly that number of clusters.
Agglomerative Clustering start out with each data point forming its own
cluster and gradually merge them into larger and larger clusters until all
points have been gathered together into one big cluster.
12.9 Answers
Self Assessment Questions
1. Cluster
2. Segmentation
3. D:All the above
Sikkim Manipal University Page No.: 191
Data Warehousing and Data Mining Unit 12
4. True
5. By example
6. Continuous
7. Means
8. True
9. Clustering
10. CLUTO
11. variance
Terminal Questions
1. Clustering is the classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters).The
basic step of k-means clustering is simple. In the beginning we
determine number of cluster K and we assume the centroid or center of
these clusters. We can take any random objects as the initial centroids
or the first K objects in sequence can also serve as the initial centroids.
(Refer section 12.4.1).
2. In hierarchical clustering the data are not partitioned into a particular
cluster in a single step. Instead, a series of partitions takes place, which
may run from a single cluster containing all objects to n clusters each
containing a single object. (Refer section 12.4.2).
3. This variant of hierarchical clustering is called top-down clustering or
divisive clustering. Divisive Clustering starts by placing all objects into a
single group. (Refer section 12.4.4).
4. Unsupervised learning is a class of problems in which one seeks to
determine how the data are organized. Unsupervised learning is closely
related to the problem of density estimation in statistics.
Clustering is one form of unsupervised learning. This unsupervised
learning is required whenever you want to estimate/hold group
operations. (Refer section 12.4).