You are on page 1of 12

1

UNIT-V
Cluster Analysis
Data Mining – Cluster Analysis
Cluster analysis, also known as clustering, is a method of data mining that groups similar
data points together. The goal of cluster analysis is to divide a dataset into groups (or
clusters) such that the data points within each group are more similar to each other than to
data points in other groups.
For example, consider a dataset of vehicles given in which it contains information about
different vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no
class labels like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a
structured manner.

Types of Clustering
We can categorize data under various rules and parameters. From simple similarities in data
values to comparing relationships between data points, there are a multitude of ways to go
about the problem. One way to categorize all the techniques is in the format below.
1. Partition based Clustering
2. Hierarchical Clustering
3. Density-based Clustering
We’ll briefly explain these before moving on to applications.
Partition Based Clustering
Given a database of n objects or data tuples, a partitioning method constructs k partitions of
the data, where each partition represents a cluster.

This clustering method classifies the information into multiple groups based on the
characteristics and similarity of the data. Its the data analysts to specify the number of
clusters that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region.
Hierarchical Clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that
groups similar objects into groups called clusters. The endpoint is a set of clusters, where
2

each cluster is distinct from each other cluster, and the objects within each cluster are broadly
similar to each other.

Density-based Clustering
Density-based spatial clustering of applications with noise (DBSCAN) is a well-known
data clustering algorithm that is commonly used in data mining and machine learning.
DBSCAN groups together points that are close to each other based on a distance
measurement (usually Euclidean distance) and a minimum number of points. It also marks as
outliers the points that are in low-density regions.

Different types of Clusters

Clustering addresses to discover helpful groups of objects (Clusters), where the objectives of
the data analysis characterize utility. Of course, there are various notions of a cluster that
demonstrate utility in practice. In order to visually show the differences between these kinds
of clusters, we utilize two-dimensional points, as shown in the figure that types of clusters
described here are equally valid for different sorts of data.

o Well-separated cluster

A cluster is a set of objects where each object is closer or more similar to every other object
in the cluster. Sometimes a limit is used to indicate that all the objects in a cluster must be
adequately close or similar to each other. The definition of a cluster is satisfied only when the
data contains natural clusters that are quite far from one another. The figure illustrates an
example of well-separated clusters that comprise of two points in a two-dimensional space.
Well-separated clusters do not require to be spherical but can have any shape.
3

o Prototype-Based cluster

A cluster is a set of objects where each object is closer or more similar to the prototype that
characterizes the cluster to the prototype of any other cluster. For data with continuous
characteristics, the prototype of a cluster is usually a centroid. It means the average (Mean) of
all the points in the cluster when a centroid is not significant. For example, when the data has
definite characteristics, the prototype is usually a medoid that is the most representative point
of a cluster. For some sorts of data, the model can be viewed as the most central point, and in
such examples, we commonly refer to prototype-based clusters as center-based clusters. As
anyone might expect, such clusters tend to be spherical. The figure illustrates an example of
center-based clusters.

o Graph-Based cluster

If the data is depicted as a graph, where the nodes are the objects, then a cluster can be
described as a connected component. It is a group of objects that are associated with each
other, but that has no association with objects that is outside the group. A significant example
of graph-based clusters is contiguity-based clusters, where two objects are associated when
they are placed at a specified distance from each other. It suggests that every object in
a contiguity-based cluster is the same as some other object in the cluster. Figures
demonstrate an example of such clusters for two-dimensional points. The meaning of a
cluster is useful when clusters are unpredictable or intertwined but can experience difficulty
when noise present. It is shown by the two circular clusters in the figure; the little extension
of points can join two different clusters.
4

Other kinds of graph-based clusters are also possible. One such way describes a cluster as
a clique. Clique is a set of nodes in a graph that is completely associated with each other.
Particularly, we add connections between the objects according to their distance from one
another. A cluster is generated when a set of objects forms a clique. It is like prototype-based
clusters, and such clusters tend to be spherical.

o Density-Based Cluster

A cluster is a compressed domain of objects that are surrounded by a region of low density.
The two spherical clusters are not merged, as in the figure, because the bridge between them
fades into the noise. Similarly, the curve that is present in the Figure disappears into the noise
and does not form a cluster in Figure. It also disappears into the noise and does not form a
cluster shown in the figure. A density-based definition of a cluster is usually occupied when
the clusters are irregularly and intertwined, and when noise and outliers exist. The other hand
contiguity-based definition of a cluster would not work properly for the data of Figure. Since
the noise would tend to form a network between clusters.

Shared- property or Conceptual Clusters


5

We can describe a cluster as a set of objects that offer some property. The object in a center-
based cluster shares the property that they are all closest to the similar centroid or medoid.
However, the shared-property approach additionally incorporates new types of the cluster.
Consider the cluster given in the figure. A triangular area (cluster) is next to a rectangular
one, and there are two intertwined circles (clusters). In both cases, a Clustering algorithm
would require a specific concept of a cluster to recognize these clusters effectively. The way
of discovering such clusters is called conceptual Clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled


dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
6

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

What are the additional issues of K-Means Algorithm in data mining?

There are various issues of the K-Means Algorithm which are as follows −

Handling Empty Clusters − The first issue with the basic K-means algorithm given prior is
that null clusters can be acquired if no points are allocated to a cluster during the assignment
phase. If this occurs, then a method is needed to choose a replacement centroid, because the
squared error will be larger than necessary.

One method is to select the point that is farthest away from some recent centroid. If this
removes the point that currently contributes some total squared error. Another method is to
select the replacement centroid from the cluster that has the largest SSE. This will generally
divide the cluster and decrease the complete SSE of the clustering. If there are multiple null
clusters, then this process can be repeated multiple times.
7

Outliers − When the squared error method is used, outliers can unduly tend to the clusters
that are discovered. In specific, when outliers are present, the resulting cluster centroids
(prototypes) cannot be as representative as they can be, and thus, the SSE will be higher as
well.

It is beneficial to find outliers and remove them beforehand. It is essential to appreciate that
there are specific clustering applications for which outliers should not be removed. When
clustering is used for data compression, each point should be clustered, and in some cases,
including financial analysis, probable outliers, e.g.,unusually profitable users, can be the
interesting points.

Reducing the SSE with Postprocessing − The method to reduce the SSE is to find more
clusters, i.e., to need a larger K. In such cases, it is likely to improve the SSE, but don't
require to increase the number of clusters. This is possible because Kmeans generally
converge to a local minimum.

Various methods are used to "fix-up" the resulting clusters to make a clustering that has
lower SSE. The method is to target on individual clusters because the complete SSE is easily
the total of the SSE contributed by every cluster. It can change the total SSE by implementing
several operations on the clusters, including splitting or merging clusters.

What is the Bisecting K-Means

The bisecting K-means algorithm is a simple development of the basic K-means algorithm
that depends on a simple concept such as to acquire K clusters, split the set of some points
into two clusters, choose one of these clusters to split, etc., until K clusters have been
produced.

The k-means algorithm produces the input parameter, k, and division a set of n objects into k
clusters so that the resulting intra cluster similarity is high but the inter cluster analogy is low.
Cluster similarity is evaluated concerning the mean value of the objects in a cluster, which
can be viewed as the cluster’s centroid or center of gravity.

The original values for the means are arbitrarily authorized. These can be authorized
randomly or perhaps can need the values from the first k input items themselves. The
convergence component can be based on the squared error, but they are needed not to be. For
instance, the algorithm is assigned to multiple clusters. Other termination methods have been
locked at a fixed number of iterations. A maximum number of iterations can be involved to
provide shopping even without convergence.
8

The Algorithm of bisecting K-Means which are as follows −

 Initialize the list of clusters to include the cluster such as all points.
 repeat
 Remove a cluster from the list of clusters.
 {Implement multiple "trial" bisections of the selected cluster.}
 for i : 1 to number of trials do
 Bisect the choose cluster utilizing basic K-means.
 end for
 Choose the two clusters from the bisection with the smallest total SSE.
 Insert these two clusters to the document of clusters.
 until the document of clusters includes K clusters.

Hierarchical clustering in data mining

Hierarchical clustering refers to an unsupervised learning procedure that determines


successive clusters based on previously defined clusters. It works via grouping data into a
tree of clusters. Hierarchical clustering stats by treating each data points as an individual
cluster. The endpoint refers to a different set of clusters, where each cluster is different from
the other cluster, and the objects within each cluster are the same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering


o Divisive Clustering

Agglomerative hierarchical clustering

Agglomerative clustering is one of the most common types of hierarchical clustering used to
group similar objects in clusters. Agglomerative clustering is also known as AGNES
(Agglomerative Nesting). In agglomerative clustering, each data point act as an individual
cluster and at each step, data objects are grouped in a bottom-up method. Initially, each data
object is in its cluster. At each iteration, the clusters are combined with different clusters until
one cluster is formed.

Agglomerative hierarchical clustering algorithm

1. Determine the similarity between individuals and all other clusters. (Find proximity
matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
9

5. Repeat step 3 and step 4 until you get a single cluster.

Let’s understand this concept with the help of graphical representation using a dendrogram.

With the help of given demonstration, we can understand that how the actual algorithm work.
Here no calculation has been done below all the proximity among the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.

Step 1:

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance
between the individual cluster from all other clusters.

Step 2:

Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are
similar to each other so that we can merge them in the second step. Finally, we get the
clusters [ (P), (QR), (ST), (V)]

Step 3:

Here, we recalculate the proximity as per the algorithm and combine the two closest clusters
[(ST), (V)] together to form new clusters as [(P), (QR), (STV)]

Step 4:
10

Repeat the same process. The clusters STV and PQ are comparable and combined together to
form a new cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]

Density-based clustering in data minin

Density-based clustering refers to a method that is based on local cluster criterion, such as
density connected points. In this tutorial, we will discuss density-based clustering with
examples.

What is Density-based clustering?

Density-Based Clustering refers to one of the most popular unsupervised learning


methodologies used in model building and machine learning algorithms. The data points in
the region separated by two clusters of low point density are considered as noise. The
surroundings with a radius ε of a given object are known as the ε neighborhood of the object.
If the ε neighborhood of the object comprises at least a minimum number, MinPts of objects,
then it is called a core object.

Density-Based Clustering - Background

There are two different parameters to calculate the density-based clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that
point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if

i belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts

Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there
is a sequence chain of a point i1,…., in, i1 = j, pn = i such that i i + 1 is directly density
reachable from ii.
11

Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there
is a sequence chain of a point i1,…., in, i1 = j, pn = i such that i i + 1 is directly density
reachable from ii.

Density connected:

A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point
o such that both i and j are considered as density reachable from o with respect to Eps and
MinPts.

DBSCAN
Clustering analysis or simply Clustering is basically an Unsupervised learning method that
divides the data points into a number of specific batches or groups, such that the data points
in the same groups have similar properties and data points in different groups have different
12

properties in some sense. It comprises many different methods based on differential


evolution.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood
of a core point.
Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with more
than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster as
the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d,
and d is a neighbor of e, which in turn is neighbor of a implying that b is a neighbor
of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.

You might also like