You are on page 1of 43

Wollega University

Chapter Five
Data Warehousing and Data Mining
November 13, 2021
CHAPTER-5

Clustering
Description
Principle
Design
Algorithm
Result Analysis
• Clustering Description

• What is a Cluster?

• A cluster is a subset of similar objects.


• A subset of objects such that the distance between any of the two
objects in the cluster is less than the distance between any object
in the cluster and any object that is not located inside it.
• A connected region of a multidimensional space with a
comparatively high density of objects.
• What is clustering in Data Mining?
• Clustering is the method of converting a group of abstract objects
into classes of similar objects.
• Clustering is a method of partitioning a set of data or objects into a
set of significant subclasses called clusters.
• It helps users to understand the structure or natural grouping in a
data set and used either as a stand-alone instrument to get a better
insight into data distribution or as a pre-processing step for other
algorithms.
• Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
• The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.
• Principle of Clustering in Data Mining
• The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large
databases.
• Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical)
data, categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should
be capable of detecting clusters of arbitrary shape.
• They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to
poor quality clusters.
• Interpretability − The clustering results should be interpretable,
comprehensible, and usable.
• Designing Methods of clustering
• Designing methods of Clustering can be classified into the
following categories .

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
• Partitioning Method
• Suppose we are given a database of ‘n’ objects and the
partitioning method constructs ‘k’ partition of data.
• Each partition will represent a cluster and k ≤ n. It means that it
will classify the data into k groups, which satisfy the following
requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.

• Points to remember −
• For a given number of partitions (say k), the partitioning method
will create an initial partitioning.
• Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
• Hierarchical Methods
• This method creates a hierarchical decomposition of the given set
of data objects. We can classify hierarchical methods on the basis
of how the hierarchical decomposition is formed.
• There are two approaches here −
• Agglomerative Approach
• Divisive Approach
• Agglomerative Approach
• This approach is also known as the bottom-up approach. In this,
we start with each object forming a separate group.
• It keeps on merging the objects or groups that are close to one
another.
• It keep on doing so until all of the groups are merged into one or
until the termination condition holds.
• Divisive Approach
• This approach is also known as the top-down approach. In this,
we start with all of the objects in the same cluster.
• In the continuous iteration, a cluster is split up into smaller
clusters.
• It is down until each object in one cluster or the termination
condition holds.
• This method is rigid, i.e., once a merging or splitting is done, it can
never be undone.
• Density-based Method
• This method is based on the notion of density.
• The basic idea is to continue growing the given cluster as long as
the density in the neighborhood exceeds some threshold, i.e., for
each data point within a given cluster, the radius of a given cluster
has to contain at least a minimum number of points.
• Grid-based Method
• In this, the objects together form a grid. The object space is
quantized into finite number of cells that form a grid structure.
• Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in
the quantized space.
• Model-based methods
• In this method, a model is hypothesized for each cluster to find
the best fit of data for a given model.
• This method locates the clusters by clustering the density
function. It reflects spatial distribution of the data points.
• This method also provides a way to automatically determine the
number of clusters based on standard statistics, taking outlier or
noise into account. It therefore yields robust clustering methods.
• Constraint-based Method
• In this method, the clustering is performed by the incorporation
of user or application-oriented constraints.
• A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process.
• Constraints can be specified by the user or the application
requirement.
• Data Mining Hierarchical Clustering Method Steps
• Below are the steps to solve the Hierarchical Clustering Method: Given
the set of ‘n’ items to be clustered and an ‘n×n’ distance matrix.
•  Step-1: Assign each item to its own cluster, such that if you have ‘n’
items now you will have ‘n’ clusters, each containing just one item. Let
the similarities between the clusters equal the similarities between the
items they contain.
•  Step-2: Find the most similar pair of clusters and merge them to the
single cluster.
•  Step-3: Compute the similarities between the new cluster and old
cluster each.
•  Step-4: Repeat step 2 and step 3 until all items are clustered into the
single cluster size ‘n’.
•  Step-3 can be carried out in a different ways; it can be single-link,
complete-link and average-link clustering. In which single link clustering
is to find the shortest distance between any data point of one cluster to
any data point of the other cluster.
• Different Types of Clustering Algorithm
• There are various types of data mining clustering
algorithms but, only few popular algorithms are
widely used.
• Basically, all the clustering algorithms uses the
distance measure method.
• Every algorithm follows a different approach to
find the ‘similar characteristics’ among the data
points. 
• Data Mining Connectivity Models
• This model follows 2 approaches.
•  1.In the first approach, they start classifying all the
data points into separate clusters, later aggregates
the data points as the distance decreases.
•  2.In the second approach, all the data points are
aggregated as a single cluster and later partitions
the data points as the distance increases.
•  However, these models are easy to interpret but it
is not the best model to handle a big data set.
Examples of these models are hierarchical
clustering.
• DBSCAN (Density Based Spatial Clustering of
Application with Noise)
• DBSCAN is a density-based algorithm.
• – Density = number of points within a specified radius (Eps)
• – A point is a core point if it has more than a specified number of
points (MinPts) within Eps. These are points that are at the interior
of a cluster
• – A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
• – A noise point is any point that is not a core point or a border
point.
• Intra object cluster : two objects with in single clusters are
communicate then it is called intra object cluster.
• Inter object cluster : two different cluster objects are
communicated then it is called inter object cluster.
• Below are the steps for DBSCAN Clustering Method:
•  1.The method requires 2 parameters:
• epsilon(Eps) and minimum points(MinPts).
• It starts with a random point that has not yet visited.
•  2.Finds all the neighbor data points within the distance Eps of the
starting point
•  3.The cluster is formed if the number of neighbors is greater than
or equal to MinPts. Starting point is marked as visited.
•  4.If the number of neighbors is less than MinPts, than the data
point is marked as noise.
•  5.The algorithm repeats the process recursively.
DBSCAN
• ∑ pr -----------radius which is useful design cluster.
• Minimum points satisfied required .
• Here major concept are
• 1. core point
• 2. boundary point
• 3.Noise point
• Core point : if any new cluster satisfied min point then it is called core
point.
• Boundary point : neighbor of core cluster is called boundary point.
• Noise point : outlier cluster which is not required min point satisfied
or no relation with core point is called noise point.
DBSCAN: Core, Border, and Noise Points
• Classical Partitioning Methods:
• The most well-known and commonly used
partitioning methods are
• The k-Means Method
• k-Medoids Method
• The k- Medoids Algorithm: The k-medoids algorithm for
partitioning based on medoid or central objects.
• The k-medoids method is more robust than k-means in the
presence of noise and outliers, because a medoid is less influenced
by outliers or other extreme values than a mean.
• its processing is more costly than the k-means method.
K-means Clustering
• Partitioned clustering approach
• Each cluster is associated with a centroid (center
point)
• Each point is assigned to the cluster with the
closest
• centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
K-means Clustering – Details
• Initial centroids are often chosen randomly.
• – Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few iterations.
• – Often the stopping condition is changed to ‘Until relatively few
• points change clusters’
• Complexity is O( n * K * I * d )
• – n = number of points, K = number of clusters,
• I = number of iterations, d = number of attributes
Limitations of K-means
• K-means has problems when clusters are of
• differing
• – Sizes
• – Densities
• – Non-globular shapes
• K-means has problems when the data contains
outliers.
• Data Mining Centroid Models
• Data mining K means algorithm is the best example that falls
under this category.
•  In this model the number of clusters required at the end is
known in prior. Therefore, it is important to have knowledge of
the data set.
• These are iterative data mining algorithms in which the data
points closer to the centroid in the data space will be aggregated
to the single cluster.
• Number of centroid is always equal to the number of clusters.
• solution:
• STEP-1 : Given data points {2,4,6,9,12,16,20,24,26}
• No of cluster to need to design is = 2
• Initially we select or assume two data points or centriods values are
{4,12}
• We need to find out the nearest values of new centriods value i.e {4,12}
• K1={2,4,6} k2={9,12,16,20,24,26}
• this one find out based on the distance between the new centriods
points and other data points.K1 and k2 is clusters and contain the
above data points .
• Now find out the mean value
• K1=2+4+6/3=4
• K2=9+12+16+20+24+26/6=17.8=18
• There for new data points for clustering is (4,18) and consider this data
points and again repeat the above process up to get same value.
• STEP-2
• K1={2,4,6,9} k2={12,16,20,24,26}
• Mean value for k1=2+4+6+9/4=5.25=5
• Mean value for k2=12+16+20+24+26/5=19.6=20
• There for new data points for clustering is (5,20) and consider
this data points and again repeat the above process.

• STEP-3
• K1={2,4,6,9,12} k2={16,20,24,26}
• Mean value for k1=2+4+6+9+12/5=6.6=7
• Mean value for k2=16+20+24+26/4=21.5=22
• There for new data points for clustering is (7,22) and consider
this data points and again repeat the above process.
• STEP-4:
• K1={2,4,6,9,12} k2={16,20,24,26}
• Mean value for k1=2+4+6+9+12/5=6.6=7
• Mean value for k2=16+20+24+26/4=21.5=22

•  Therefore both side we get same value compare to previous


one .so final mean value is or data points are (7,22).

• Above data points are centriods values for new cluster. New
cluster are design based on data points.
• Example: 2
• Design Two Cluster Based On Following Data Points Using K-
Means Clustering Algorithm and Euclidean distance formula. d =
√[(x – x1)2 + (y – y1)2].
• Given data is
• S.NO X Y
• 1 170 60
• 2. 160 55
• 3 180 75
• 4 150 50
• 5 175 65
• 6 190 85
• STEP:1 First we assume that two centriods for cluster i.e as per
given data we are considering row-1 and row -2 .so k=2.
• initial centroids.
• C X Y
• C1 170 60
• C2 160 55
• STEP-2
• For applying Euclidean distance formula. d = √[(x – x1)2 + (y – y1)2].
initial centroids table C1,C2and row 3.then we get new centriods
for clustering.
• Formula Applying For C1 Formula Applying For C2
• here x=180 y= 75 and x1 =170 y1 =60 here x=180 y= 75 and x1 =160 y1 =55
• d = √[(180 – 170)2 + (75 – 60)2] d = √[(180 – 160)2 + (75 – 55)2]
• =√[(10)2 + (15)2] =√[(20)2 + (20)2]
• =√[(100) + (225) =√[(400) + (400)
• =√[325] =√[800]
• =18.02 =28.28
• =18 =28
• Therefore based on the above values we are designing clusters
based on new centriods values i.e.
• → we get two points (18,28) here 18 is smallest value compare to
28.by using following formula we get new centriod value for
clustering.
• X = 170 + 180/2=175 ( first row first value + third row first value) / 2
• Y=60+75/2=67.5 ( first row second value + third row second
value) / 2
• Above step repeated up to all rows finished.
• New centriods is given below
• C X Y
• C1 175 67.5
• C2 160 55
• Now New Cluster with centriods or data points are C1={1,3........}
• C2={2,.........}
• STEP-3
• For applying Euclidean distance formula. d = √[(x – x1)2 + (y –
y1)2].previous centroids table C1 ,C2 and row 4.then we get new
centriods for clustering.
• Formula Applying For C1 Formula Applying For C2
• here x=175 y= 67.5 and x1 =150 y1 =50 here x=160 y= 55 and x1 =150 y1 =50
• d = √[(175– 150)2 + (67.5 – 50)2] d = √[(160 – 150)2 + (55 – 50)2]
• =√[(25)2 + (17.5)2] =√[(10)2 + (5)2]
• =√[931.25] =√[125]
• =30.51 =11.18

• Therefore based on the above values we are designing clusters


based on new centriods values i.e. here changed the second
centriod value in below table.
• → above modified row we get because above two points
(30.51,,11.18) here 11.188 is smallest value compare to 28.by using
following formula we get new centriod value for clustering.
• X = 160 + 150/2=155 (previous centriod table C2 first value + third
row first value) / 2
• Y=55+50/2=52.5 (previous centriod table C2 row second value +
third row second value) / 2
• New centriods is given below
• C X Y
• C1 175 67.5
• C2 155 52.5
•  Now new cluster with centriods or data points are C1={1,3........}
• C2={2,4,.........}
• STEP-4:
• For applying Euclidean distance formula. d = √[(x – x1)2 + (y – y1)2]. previous
centroids table C1,C2 and row 5.then we get new centriods for clustering.
• Formula Applying For C1 Formula Applying For C2
• here x=175 y= 67.5 and x1 =175 y1 =65 here x=155 y= 52.5 and x1 =175 y1 =65
• d = √[(175– 175)2 + (67.5 – 65)2] d = √[(155 – 175)2 + (65 – 52.5)2]
• =√[(0)2 + (2.5)2] =√[(25)2 + (12.5)2]
• =√[(2.5)2] =√[625+169]
• =2.5 =23.58
• Therefore Based On The Above Values We Are Designing Clusters Based On
New Centriods table Values I.E. Here Changed The FIRST C1 Centriod Value
In Below Table.
• → above modified row we get because above two points (2.5,23.58) here
2.5 is smallest value compare to 23.58.by using following formula we get
new centriod value for clustering.
• X = 175 + 175/2=175 ( previous centriod table C1 row first value + Fifth row
first value) / 2
• Y=67.5+65/2=66.5 (previous centriod table C2 row second value + Fifth
row second value) / 2
• New centriods is given below
• C X Y
• C1 175 66.25
• C2 155 52.5
•  Now new cluster with centriods or data points are c1={1,3,5........}
• C2={2,4,.........}
• STEP-5:
• For applying Euclidean distance formula. d = √[(x – x1)2 + (y – y1)2].
previous centroids table C1,C2 and row 6.then we get new centriods
for clustering.
• Formula Applying For C1 Formula Applying For C2
• here x=190 y= 85 and x1 =175 y1 =66.25 here x=155 y= 52.5 and x1 =190 y1 =85
• d = √[(190– 175)2 + (85 – 66.25)2] d = √[(155 – 190)2 + (52.5 – 85)2]
• =√[(15)2 + (18.75)2] =√[(35)2 + (32.5)2]
• =√[(18.75)2] =√[1225+1056.25]
• =24.01 =47.76
• Therefore Based On The Above Values We Are Designing Clusters
Based On New Centriods table Values I.E. Here Changed The FIRST C1
Centriod Value In Below Table.
• → above modified row we get because above two points (2.5,23.58)
here 2.5 is smallest value campare to 23.58.by using following
formula we get new centriod value for clustering.
• X = 175 + 190/2=182.5 ( previous centriod table C1 row first value +
Sixth row first value) / 2
• Y=62.5+85/2=73.75 (previous centriod table C2 row second value
+ sixth row second value) / 2
• New centriods is given below
• C X Y
• C1 182.5 75.6
• C2 155 52.5
• Now new cluster with centriods or data points are C1={1,3,5,6........}
• C2={2,4,.........}
• Data Mining Distribution Models
•  These models are based on predicting how probable is that the data
points in the cluster belong to the same distribution (Gaussian).
Popular example for this model is Expectation- Maximization
algorithm.

•  Data Mining Density Models


• These models search for areas of varied density of data points in the
data space.
• It isolates various different density regions and assigns the data
points within these regions in the same cluster.
• Popular examples of density models are DBSCAN and OPTICS.
• Data Mining DBSCAN (Density Based Spatial Clustering of
Applications with Noise) Method
• Below are the steps for DBSCAN Clustering Method:
•  1.The method requires 2 parameters: epsilon(Eps) and minimum
points(Min Pts). It starts with a random point that has not yet
visited.
•  2.Finds all the neighbor data points within the distance Eps of the
starting point
•  3.The cluster is formed if the number of neighbors is greater than
or equal to Min Pts. Starting point is marked as visited
• 4.If the number of neighbors is less than Min Pts, than the data
point is marked as noise.
•  5.The algorithm repeats the process recursively.
• Applications of Cluster Analysis
• Clustering analysis is broadly used in many applications such as
market research, pattern recognition, data analysis, and image
processing.
• Clustering can also help marketers discover distinct groups in
their customer base. And they can characterize their customer
groups based on the purchasing patterns.
• Clustering also helps in classifying documents on the web for
information discovery.
• Clustering is also used in outlier detection applications such as
detection of credit card fraud.
• Clustering also helps in identification of areas of similar land use
in an earth observation database. It also helps in the
identification of groups of houses in a city according to house
type, value, and geographic location.

You might also like