You are on page 1of 21

MODULE 5

Q. NO Questions and Answers Mark


s

1 Briefly explain different types of clusters.

Types of Clusters
 Well-separated clusters
 Center-based clusters
 Contiguous clusters
 Density-based clusters
 Property or Conceptual

Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point
in the cluster than to any point not in the cluster

Center-based (proto type based)

– A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the
most “representative” point of a cluster
Contiguous Cluster (Graph based)

– A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other
points in the cluster than to any point not in the cluster.

Density-based

– A cluster is a dense region of points, which is separated by low-density regions, from other regions of
high density.
– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

Shared Property or Conceptual Clusters

– Finds clusters that share some common property or represent a particular concept.
2 Define clustering. Explain different types of clustering. 8

Clustering is a processs of grouping of data objects in suh a way that objects within a group be similar (or
related) to one another and different from (or unrelated to) the objects in other groups.

Various types of clusterings:


a. Hierarchical (nested) versus partitional (unnested)
b. Exclusive versus overlapping versus finzy, and
c. Complete versus partial.

Hierarchical (nested) versus partitional (unnested)


 A partitional clustering is simply a division of the set of data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset.
 A hierarchical clustering, which is a set of nested clusters that are organized as a tree. Each node
(cluster) in the tree (except for the leaf nodes) is the union of its children (subclusters), and the root
ofthe tree is the cluster containing all the objects.

Exclusive versus Overlapping versus Ftzzy

 In exclusive clusterings, assign each object to a single cluster. There are many situations in which a
point could reasonably be placed in more than one cluster, and these situations are better addressed
by non-exclusive clustering.
 In the most general sense, an overlapping or non-exclusive clustering is used to reflect the fact that an
object can si,multaneouslyb elong to more than one group (class). For instance, a person at a
university can be both an enrolled student and an employee of the university. A non-exclusive
clustering is also often used when, for example, an object is "between" two or more clusters and
could reasonably be assigned to any of these clusters.
 In a fizzy clustering, every object belongs to every cluster with a membership weight that is between
0 (absolutely doesn't belong) and 1 (absolutely belongs). In other words, clusters are treated as finzy
sets

Complete versus Partial


 A complete clustering assigns every object to a cluster, whereas a partial clustering does not. The
motivation for a partial clustering is that some objects in a data set may not belong to well-defined
groups. Many times objects in the data set may represent noise, outliers, or "uninteresting
background."
3 Explain k-means clustering method with suitable example. 8

 Partitional clustering approach


 This method can only be used if the data-object is located in the main memory.
 The method is called K-means sinceeach of the K clusters is represented by the mean of the objects(
called the centriod) within it.
 The method is also called the centroid-method sinceat each step, the centroid-point of each cluster is
assumed to be known andeach of the remaining points are allocated to the cluster whose centroid is
closest to it.

The algorithm is as follows


1. Select the number of clusters=k (Figure 4.1c).
2. Pick k seeds as centroids of k clusters. The seeds may be picked randomly unless the user
has some insight into the data.
3. Compute Euclidean distance of each object in the dataset from each of the centroids.
4. Allocate each object to the cluster it is nearest to (based on the distances computed in the
previous step).
5. Compute the centroids of clusters by computing the means of the attribute values of the
objects in each cluster.
6. Check if the stopping criterion has been met (e.g. the cluster-membership is unchanged). If yes,
go to step 7. If not, go to step 3.
7. One may decideto stop at this stage or to split a cluster or combine two clusters until a stopping
criterion is met.
4 Explain Evaluation K-means Clusters

Most common measure is Sum of Squared Error (SSE).


For each point, the error is the distance to the nearest cluster.
To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi
corresponds to the center (mean) of the cluster.
Given two clusters, we can choose the one with the smallest error.

One easy way to reduce SSE is to increase K, the number of clusters.


A good clustering with smaller K can have a lower SSE than a poor clustering with higher K.
Problems with Selecting Initial Points
If there are K „real‟ clusters then the chance of selecting one centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036


Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t

5 Explain BisectingK-means algorithm

The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm that is based
on a simple idea: to obtain K clusters, split the set of all points into two clusters, select one of these clusters to
split, and so on, until K clusters have been produced.
There are a number of different ways to choose which cluster to split. We can choose the largest cluster at
each step, choose the one with the largest SSE, or use a criterion based on both size and SSE. Different
choices result in different clusters.

Example: To illustrate that bisecting K-means is less susceptible to initialization problems, shown in Figure
8.8, how bisecting K-means finds four clusters in the data set originally shown in Figure 8.6(a). In iteration
1, two pairs of clusters are found; in iteration 2,the rightmost pair of clusters is split; and in iteration 3, the
leftmost pair of clusters is split. Bisecting K-means has less trouble with initialization because it performs
several trial bisections and takes the one with the lowest SSE, and because there are only two centroids at
each step.
Finally, by recording the sequence of clusterings produced as K-means bisects clusters, we can also use
bisecting K-means to produce a hierarchical clustering.

6
7 Discuss the hierarchical clustering method in detail. 10

 Agglomerative clustering method start with the points as individual clusters and, at each step, merge the
closest pair of clusters. This requires defining a notion of cluster roximity.
 A hierarchical clustering is often displayed graphically using a tree-like diagram called a dendrogram,
which displays both the cluster-subcluster relationships and the order in which the clusters were merged
(agglomerative view) or split (divisive view).
 For sets of two-dimensional points, such as those that we will use as examples, a hierarchical clustering
can also be graphically represented using a nested cluster diagram. Figure 8.13 shows an example of
these two types of figures for a set of four two-dimensional points.

Basic Agglomerative Hierarchical Clustering Algorithm

Many agglomerative hierarchical clustering techniques are variations on a single approach: starting with
individual points as clusters, successively merge the two closest clusters until only one cluster remains
How to Define Inter-Cluster Similarity (Proximity of two clusters):

1. MIN
2. MAX
3. Group Average

1) Single Link or MIN


For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined as the
minimum of the distance (maximum of the similarity) between any two points in the two different clusters.

2) Complete Link or MAX or CLIQUE


For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is defined as
the maximum of the distance (minimum of the similarity) between any two points in the two different
clusters.

3) Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is defined as the
average pairwise proximity among all pairs of points in the different clusters.

8 What are satrengths and weaknesses of K-means? 10

 K-means is simple and can be used for a wide variety of data types
 It is also quite efficient, even though multiple runs are often performed.
 K-means is not suitable for all types of data.
 K-means has problems when clusters are of differing
o Sizes
o Densities
o Non-globular shapes
 K-means has problems when the data contains outliers.

9 How density based methods are used for clustering? Explain with example 8

 DBSCAN is a density-based algorithm


 In the center-based approach, density is estimated for a particular point in the data set by counting
the number of points within a specified radius, Eps of that point.
-Density = number of points within a specified radius (Eps)

Classification of points

 A point is a core point if it has more than a specified number of points (MinPts) within Eps
These are points that are at the interior of a cluster
 A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
 A noise point is any point that is not a core point or a border point.

DBSCAN Algorithm
Given the previous definitions of core points, border points, and noise points, the DBSCAN algorithm can be
informally described as follows. Any two core points that are close enough-within a distance Eps of one
another-are put in the same cluster. Likewise, any border point that is close enough to a core point is put in
the same cluster as the core point. (Ties may need to be resolved if a border point is close to core points from
different clusters.) Noise points are discarded. The formal details are given in Algorithm 8. 4
10 How the parameters are selected in DBSCAN algorithm. 4

 The issue in DBSCAN algorithm is how to determine the parameters Eps and MinPts. The basic
approach is to look at the behavior of the distance from a point to its kth nearest neighbor, which we will
call the k-dist.
 For points that belong to some cluster, the value of k-dist will be small if k is not larger than the cluster
size. Note that there will be some variation, depending on the density of the cluster and the random
distribution of points, but on average, the range of variation will not be huge if the cluster densities are
not radically different.
 However, for points that are not in a cluster, such as noise points, the k-dist will be relatively large.
Therefore, if we compute the k-dist for all the data points for some ,k, sort them in increasing order, and
then plot the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a
suitable value of Eps.
 lf we select this distance as the Eps parameter and take the value oi k as the MinPts parameter, then points
for which ,k-dist is less than Eps will be labeled as core points, while other points will be labeled as noise
or borderr points.
11 Explain the key issues in hierarchical clustering(Strengths and Weaknesses) 8

 Once a decision is made to combine two clusters, it cannot be undone


 No objective function is directly minimized
 Different schemes have problems with one or more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex shapes
– Breaking large clusters

12 Briefly explain the strengths and weaknesses of DBSCAN algorithm 10

o It is relatively Resistant to Noise.


o It can Can handle clusters of different shapes and sizes
o Does NOT Work Well when the clusters having Varying densities
o Does NOT Work Well With High-dimensional data.
13 Briefly explain DENCLUE clustering method 4

o DENCLUE: A Kernel-Based Scheme for Density-Based Clustering


o DENCLUE (DENsity ClUstEring) is a density-based clustering approach that models the overall density
of a set of points as the sum of influence functions associated with each point.
o The resulting overall density function will have local peaks, i.e., local density maxima, and these local
peaks can be used to define clusters in a natural way. Specifically, for each data point, a hill climbing
procedure finds the nearest peak associated with that point, and the set of all data points associated with a
particular peak (called a local density attractor) becomes a cluster.
Example: Figure 9.13 shows a possible density function for a one-dimensional data set. Points A-E are the
peaks of this density function and represent local density attractors. The dotted vertical lines delineate local
regions of influence for the local density attractors. Points in these regions will become center-defined
clusters. The dashed horizontal line shows a density threshold, €. All points associated with a local density
attractor that has a density less than (, such as those associated with C, will be discarded. All other clusters
are kept. Note that this can include points whose density is less than (, as long as they are associated with
local density attractors whose density is greater than {. Finally, clusters that are connected by a path of points
with a density above { are combined. Clusters A and B would remain separate, while clusters D and E would
be combined.

14 Explain the following density based clustering techniques. 6


i. CLIQUE
ii. DENCLUE

i) CLIQUE
CLIQUE (Clustering In QUEst) is a grid-based clustering algorithm that methodically finds subspace
clusters. It is impractical to check each subspace for clusters since the number of such subspaces is
exponential in the number of dimensions. Instead, CLIQUE relies on the following property;
Monotonicity property of density-based clusters: If a set of points forms a density-based cluster in k
dimensions (attributes), then the same set of points is also part of a density-based cluster in all
possible subsets of those dimensions.

ii) DENCLUE
DENCLUE (DENsity ClUstEring) is a density-based clustering approach that models the overall density
of a set of points as the sum of influence functions associated with each point. The resulting overall
density function will have local peaks, i.e., local density maxima, and these local peaks can be used to
define clusters in a natural way. Specifically, for each data point, a hill climbing procedure finds the
nearest peak associated with that point, and the set of all data points associated with a particular peak
(called a local density attractor) becomes a cluster.

15 Explain with an algorithm MST clustering. 6

 MST is a divisive hierarchical technique, starts with the minimum spanning tree of the proximity graph
and can be viewed as an application of sparsification for finding clusters
 A minimum spanning tree of a graph is a subgraph that (1) has no cycles, i.e., is a tree, (2) contains all the
nodes of the graph, and (3) has the minimum total edge weight of all possible spanning trees. minimum
spanning tree, assumes that we are working only with dissimilarities or distances An example of a
minimum spanning tree for some two-dimensional points is shown in Figure 9.16.
 The MST divisive hierarchical algorithm is shown in Algorithm 9.7. The first step is to find the MST of
the original dissimilarity graph. Note that a minimum spanning tree can be viewed as a special type of
sparsified graph.
 Step 3 can also be viewed as graph sparsification. Hence, MST can be viewed as a clustering algorithm
based on the sparsification of the dissimilarity graph.

16 Explain chameleon algorithm. 6

 CHAMELEON: hierarchical clustering using dynamic modeling


 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and closeness (proximity) between two
clusters are high relative to the internal interconnectivity of the clusters and closeness of
items within the clusters
 A two phase algorithm
 1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small
sub-clusters
 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by
repeatedly combining these sub-clusters

Deciding Which Clusters to Merge

 Relative Closeness (RC) is the absolute closeness of two clusters normalized by the internal
closeness of the clusters

 Relative Interconnectivity (RI) is the absolute interconnectivity of two clusters normalized by the
internal connectivity of the clusters.

CHAMELEON Algorithm:

17 Write BIRCH algorithm for clustering and explain. 6

BIRCH is a Balanced Iterative Reducing and Clustering using Hirarchies

 Plays significant role in scalability of very large data sets


 Technique suggested where data is taken as average

 Designed upon 2 prime factors:


1. Clustering Feature(CF) =<N, LS, SS>
2. CF tree, highly balanced tree

 Internal nodes stored as sum of their descendants.


18 Explain Graph-Based Clustering method

Graph-Based clustering uses the proximity graph


• Start with the proximity matrix
• Consider each point as a node in a graph
• Each edge between two nodes has a weight which is the proximity between the two points
• Initially the proximity graph is fully connected
• MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph
In the simplest case, clusters are connected components in the graph.

Sparsification

The amount of data that needs to be processed is drastically reduced

o Sparsification can eliminate more than 99% of the entries in a proximity matrix
o The amount of time required to cluster the data is drastically reduced
o The size of the problems that can be handled is increased.

Clustering may work better


o Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point
while breaking the connections to less similar points.
o The nearest neighbors of a point tend to belong to the same class as the point itself.

This reduces the impact of noise and outliers and sharpens the distinction between clusters

Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph partitioning
algorithms.
– Chameleon and Hypergraph-based Clustering

19 Write CURE algorithm for clustering and explain. 6

 It can detect clusters with non-spherical shapes with variable size


 Works very well with outliers
 Efficiently for working with large datasets

CURE Architecture:
Procedure

 Select target sample number C


 Choose C well scattered points in a cluster
 These scattered point are shrunk towards centroids
 Then these points are used as representatives of clusters and used in „dmin‟ cluster merging approach
 After every merging new sample points will be scattered to represent the new cluster
 Cluster merging will stop until target k is reachable
19 Explain SNN Clustering Algorithm 6

1)Compute the similarity matrix:


This corresponds to a similarity graph with data points for nodes and edges whose weights are the
similarities between data points

2)Sparsify the similarity matrix by keeping only the k most similar neighbors
This corresponds to only keeping the k strongest links of the similarity graph

3)Construct the shared nearest neighbor graph from the sparsified similarity matrix.
At this point, we could apply a similarity threshold and find the connected components to obtain the
clusters (Jarvis-Patrick algorithm)

4)Find the SNN density of each Point.


Using a user specified parameters, Eps, find the number points that have an SNN imilarity of Eps or
greater to each point. This is the SNN density of the point.

5)Find the core points


Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density
greater than MinPts.

6)Form clusters from the core points.


If two core points are within a radius, Eps, of each other they are place in the same cluster
7)Discard all noise points.
All non-core points that are not within a radius of Eps of a core point are discarded .

8)Assign all non-noise, non-core points to clusters.


This can be done by assigning such points to the nearest core point
20 Briefly explain basic grid based clustering algorithm. 6

The idea is to split the possible values of each attribute into a number of contiguous intervals, creating a set of
grid cells.

Objects can be assigned to grid cells in one pass through the data, and information about each cell, such as
the number of points in the cell, can also be gathered at the same time.

Defining Grid Cells: This is a key step in the process, but also the least well defined, as there are many ways
to split the possible values of each attribute into a number of contiguous intervals.

For continuous attributes, one common approach is to split the values into equal width intervals. If this
approach is applied to each attribute, then the resulting grid cells all have the same olume, and the density of
a cell is conveniently defined as the number of points in the cell.

The Density of Grid Cells: A natural way to define the density of a grid cell (or a more generally shaped
region) is as the number of points divided by the volume of the region. In other words, density is the number
of points per amount of space, regardless of the dimensionality of that space

Example: Figure 9.10 shows two sets of two dimensional points divided into 49 cells using a 7- by-7 grid.
The first set contains 200 points generated from a uniform distribution over a circle centered at (2, 3) of
radius 2, while the second set has 100 points generated from a uniform distribution over a circle centered at
(6, 3) of radius 1. The counts for the grid cells are shown in Table 9.2.
21 What are the various issues considered for cluster validation? Explain different evaluation measures
used for cluster validity

Different Aspects of Cluster Validation


1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random
structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class
labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to external
information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine which is better.
5. Determining the „correct‟ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just
individual clusters.

Evaluation measures used for cluster validity


 Numerical measures that are applied to judge various aspects of cluster validity, are classified into
the following three types.
 Supervised (external indices): Used to measure the extent to which cluster labels match
externally supplied class labels.
Ex: Entropy
 Unsupervised(Internal indices): Used to measure the goodness of a clustering structure
without respect to external information.
Ex: Sum of Squared Error (SSE)
-divided into 2 classes
1. Cluster cohesion
2. cluster separation
 Relative Index: Used to compare two different clusterings or clusters.
 Often a supervised or unsupervised is used for this function, e.g., SSE or entropy

You might also like