You are on page 1of 27

Data Mining Unit-4

Lecture Notes
---------------------------------------------------------------------------------------------------------------
Clustering and Applications: Cluster Analysis – Types of data in cluster analysis –
Categorization of Major Clustering Methods – Partitioning Methods, Hierarchical Methods-
Density based Methods, Grid based Methods, Outlier Analysis.

Topic 1: Cluster Analysis

Cluster Analysis?

 Cluster analysis or simply clustering is the process of partitioning a set of data


objects (or observations) into subsets. Each subset is a cluster, such that objects in a
cluster are similar to one another, yet dissimilar to objects in other clusters.
 Clustering is an unsupervised Machine Learning-based Algorithm that comprises a
group of data points into clusters so that the objects belong to the same group.

 Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.

 Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
 In business intelligence, clustering can be used to organize a large number of
customers into groups, where customers within a group share strong similar
characteristics

 Moreover, consider a consultant company with a large number of projects. To


improve project management, clustering can be applied to partition projects into
categories based on similarity so that project auditing and diagnosis (to improve
project delivery and outcomes) can be conducted effectively.
 In image recognition, clustering can be used to discover clusters or “subclasses” in
hand written character recognition systems. Suppose we have a data set of
handwritten digits, where each digit is labelled as either 1, 2, 3, and so on. Note that
there can be a large variance in the way in which people write the same digit. Take
the number 2, for example. Some people may write it with a small circle at the left
bottom part, while some others may not. We can use clustering to determine
subclasses for “2,” each of which represents a variation on the way in which 2 can be
written. Using multiple models based on the subclasses can improve overall
recognition accuracy.
 Clustering can be used to organize the search results into groups and present the
results in a concise and easily accessible way.
 Applications of cluster analysis in data mining:
 In many applications, clustering analysis is widely used, such as data analysis,
market research, pattern recognition, and image processing.
 It assists marketers to find different groups in their client base and based on the
purchasing patterns. They can characterize their customer groups.
 It helps in allocating documents on the internet for data discovery.
 Clustering is also used in tracking applications such as detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
 In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into structure
inherent to populations.
 It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city according to
house type, value, and geographical location.

Topic -2 Types of data in cluster analysis


Types of Data Structures
First of all, let us know what types of data structures are widely used in cluster analysis.

We shall know the types of data that often occur in cluster analysis and how to preprocess
them for such analysis.
Suppose that a data set to be clustered contains n objects, which may represent persons,
houses, documents, countries, and so on.

Main memory-based clustering algorithms typically operate on either of the following two
data structures.
Types of data structures in cluster analysis are
 Data Matrix (or object by variable structure)
 Dissimilarity Matrix (or object by object structure)

Data Matrix
This represents n objects, such as persons, with p variables such as age, height, weight,
gender, race and so on. The structure is in the form of a relational table, or n-by-p matrix (n
objects x p variables)

The Data Matrix is often called a two-mode matrix since the rows and columns of this
represent the different entities.

Dissimilarity Matrix
It is often represented by a n – by – n table, where d(i,j) is the measured difference or
dissimilarity between objects i and j. In general, d(i,j) is a non-negative number that is close
to 0 when objects i and j are higher similar or “near” each other and becomes larger the
more they differ. Since d(i,j) = d(j,i) and d(i,i) =0,
This is also called as one mode matrix since the rows and columns of this represent the same
entity.

Types Of Data Used In Cluster Analysis Are:


 Interval-Scaled variables
 Binary variables
 Nominal, Ordinal, and Ratio variables
 Variables of mixed types
Interval-Scaled Variables
 Interval-scaled variables are continuous measurements of a roughly linear scale.

 Typical examples include weight and height, latitude and longitude coordinates (e.g.,
when clustering houses), and weather temperature.

 The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to a very different clustering structure.

 To help avoid dependence on the choice of measurement units, the data should be
standardized. Standardizing measurements attempts to give all variables an equal
weight.

 This is especially useful when given no prior knowledge of the data. However, in
some applications, users may intentionally want to give more weight to a certain set
of variables than to others.

 For example, when clustering basketball player candidates, we may prefer to give
more weight to the variable height.

Binary Variables
 A binary variable is a variable that can take only 2 values.
 For example, generally, gender variables can take 2 variables male and female.
 Contingency Table For Binary Data
 Let us consider binary values 0 and 1
Sub type of Binary variable
 Symmetric binary : we cannot change the values according user wish Ex: male or
female
 Asymmetric binary: we can change the values according to user wish Ex: Covid

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):


Categorical Variables
 The data can be divided in to categories
 A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green.
 Categorical variables are divided in to two types
 Nominal variables:
 Method 1: Simple matching
 The dissimilarity between two objects i and j can be computed based on the simple
matching.
 m: Let m be no of matches (i.e., the number of variables for which i and j are in the
same state).
 p: Let p be total no of variables.

 Method 2: use a large number of binary variables


 Creating a new binary variable for each of the M nominal states.
 Ordinal Variables
 An ordinal variable can be discrete or continuous.
 In this order is important, e.g., rank.
 It can be treated like interval-scaled
 By replacing xif by their rank,

 By mapping the range of each variable onto [0, 1] by replacing the i-th object in the
f-th variable by,

 Variables Of Mixed Type


 A database may contain all the six types of variables
 Symmetric binary, asymmetric binary, nominal, ordinal, interval And those
combinedly called as mixed-type variables.
Topic -3 Categorization of Major Clustering Methods

Partitioning Methods:
The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the
problem specification concise, we can assume that the number of clusters is given as
background knowledge. This parameter is the starting point for partitioning methods.
Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions
k-Means: A Centroid-Based Technique
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods
distribute the objects in D into k clusters, C1, : : : ,Ck,
An objective function is used to assess the partitioning quality so that objects within a
cluster are similar to one another but dissimilar to objects in other clusters. This is, the
objective function aims for high intra cluster similarity and low inter cluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent
that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can
be defined in various ways such as by the mean or medoid of the objects (or points)
assigned to the cluster.

The k-means clustering algorithm mainly performs two tasks:


 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
 Step-1: Select the number K to decide the number of clusters.
 Step-2: Select random K points or centroids. (It can be other from the input dataset).
 Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
 Step-4: Calculate the variance and place a new centroid of each cluster.
 Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.
 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
 Step-7: The model is ready.
Hence each cluster has data points with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Hierarchical clustering Methods
Hierarchical clustering refers to an unsupervised learning procedure that determines
successive clusters based on previously defined clusters. It works via grouping data into a
tree of clusters.
Hierarchical clustering stats by treating each data points as an individual cluster.
The endpoint refers to a different set of clusters, where each cluster is different from the
ther cluster, and the objects within each cluster are the same as one another.
There are two types of hierarchical clustering
 Agglomerative Hierarchical Clustering
 Divisive Clustering

Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with


taking all data points as single clusters and merging them until one cluster is left.
Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.
Why hierarchical clustering?
In the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size.
To solve these two challenges, we can opt for the hierarchical clustering algorithm because,
in this algorithm, we don't need to have knowledge about the predefined number of clusters.

Agglomerative hierarchical clustering


 Agglomerative clustering is one of the most common types of hierarchical clustering
used to group similar objects in clusters.
 Agglomerative clustering is also known as AGNES (Agglomerative Nesting).
 In agglomerative clustering, each data point act as an individual cluster and at each
step, data objects are grouped in a bottom-up method.
 Initially, each data object is in its cluster.
 At each iteration, the clusters are combined with different clusters until one cluster is
formed.
Step 1: Determine the similarity between individuals and all other clusters. (Find
proximity matrix).

Step 2: Consider each data point as an individual cluster.

Step 3: Combine similar clusters.

Step 4: Recalculate the proximity matrix for each cluster.

Step 5: Repeat step 3 and step 4 until you get a single cluster.
Divisive Hierarchical Clustering
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical
clustering.
In Divisive Hierarchical clustering, all the data points are considered an individual
cluster, and in every iteration, the data points that are not similar are separated from the
cluster.
The separated data points are treated as an individual cluster.
Finally, we are left with N clusters.
Example:

We need to calculate distance between ( p1,[p3,p5]) , ( p2,[p3,p5]) and ( p4,[p3,p5])


Then the matrix representation will be given bellow

Now to combine or joining a cluster I need to considered least value in matrix i.e 5 which
contain the objects p2 and p4
We need to calculate distance between (p1,[p2,p4])

The matrix representation will be given bellow

Now I need to take least value in matrix to combine or join the cluster i.e 9 so we need
join p1 and p2,p4
The final visualization of cluster will be given bellow
Example 2:
Advantages of Hierarchical clustering
 It is simple to implement and gives the best output in some cases.
 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
 It breaks the large clusters.
 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously.
Density-Based Methods
 Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
 They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval
clusters Given such data, they would likely inaccurately identify convex regions, where
noise or outliers are included in the clusters.
 To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions
in the data space, separated by sparse regions.
 This is the main strategy behind density-based clustering methods, which can discover
clusters of non spherical shape
 Density-based clustering by studying DBSCAN ((Density-Based Spatial Clustering of
Applications with Noise)

In DBSCAN basically we required two basic points


 Epsilon – it is like as radios
 Minimum points – Minimum values which are needed in cluster
 Core point is a data point of a cluster which is satisfy the minimum data points
suppose a is my main data point with epsilon I have form a circle in that circle at
least 4 points are available the a will be core point
 Border point is a data point of a cluster which is not satisfy the minimum data
points suppose b is my main data point with epsilon I have form a circle in that
circle I have only 2 points are available then we need to check nearby point if
there is any core point is available then that point is called as border point
 Noise point is a data point of a cluster which doesn’t have any relation with
either core point or Border point is called as Noise point
Grid – Based Method
 The cluster methods so far are data driven they partitioning the set of data objects and
adapt to the distribution of the objects in the embedding space
 Alternatively Grid based clustering method takes a space-driven approach by
partitioning the embedding space in to cells independent of the distribution of the
input objects
 The grid-based clustering approach uses a multiresolution grid data structure
 It quantizes the data space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
 The main advantage of the approach is its fast processing time which is typically
independent of the number of data objects, yet dependent on only the number of cells

`
 Grid-based clustering using several interesting methods.
 STING: explores statistical information stored in the grid cells.
 CLIQUE: represents a grid- and density-based approach for subspace clustering in a
high-dimensional data space.
 STING (STatistical INformation Grid) STING is a grid-based multiresolution
clustering technique in which the embedding spatial area of the input objects is
divided into rectangular cells. The space can be divided in a hierarchical and recursive
way.
 Several levels of such rectangular cells correspond to different levels of resolution and
form a hierarchical structure:
 Each cell at a high level is partitioned to form a number of cells at the next lower
level.
 Statistical information regarding the attributes in each grid cell, such as the mean,
maximum, and minimum values, is precomputed and stored as statistical parameters.

The figure shows a hierarchical structure for STING clustering. The statistical parameters of
higher-level cells can easily be computed from the parameters of the lower-level cells.
These parameters include the following: the attribute-independent parameter, count; and the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum), and the type of distribution that the attribute value in the cell follows such as
normal, uniform, exponential, or none

“How is this statistical information useful for query answering?” The statistical parameters
can be used in a top-down, grid-based manner as follows. First, a layer within the
hierarchical structure is determined from which the query-answering process is to start. This
layer typically contains a small number of cells. For each cell in the current layer, we
compute the confidence interval (or estimated probability range) reflecting the cell’s
relevancy to the given query. The irrelevant cells are removed from further consideration.
Processing of the next lower level examines only the remaining relevant cells. This process
is repeated until the bottom layer is reached. At this time, if the query specification is met,
the regions of relevant cells that satisfy the query are returned. Otherwise, the data that fall
into the relevant cells are retrieved and further processed until they meet the query’s
requirements.

“What advantages does STING offer over other clustering methods?” STING offers several
Advantages:

 The grid-based computation is query-independent because the statistical information


stored in each cell represents the summary information of the data in the grid cell,
independent of the query
 The grid structure facilitates parallel processing and incremental updating
 The method’s efficiency is a major advantage

Disadvantage:
 All the cluster boundaries are either horizontal or vertical and no diagonal boundary is
detected
CLIQUE
For example, consider a health informatics application where patient records contain
attributes describing, Personal information, Numerous symptoms, Conditions and
Family history. In bird flu patients, for instance, the age, gender, and job attributes may
vary dramatically within a wide range of values.
Thus, it can be difficult to find such a cluster within the entire data space. Instead, by
searching in subspaces, we may find a cluster of similar patients in a lower-dimensional
space (e.g., patients who are similar to one other with respect to symptoms like high
fever, cough but no runny nose, and aged between 3 and 16).

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density


based clusters in subspaces.
CLIQUE partitions each dimension into non overlapping intervals, thereby partitioning
the entire embedding space of the data objects into cells.
It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the
number of objects mapped to it exceeds the density threshold.
What Are Outliers?
Assume that a given statistical process is used to generate a set of data objects. An outlier
is a data object that deviates significantly from the rest of the objects, as if it were generated
by a different mechanism.

Outliers are different from noisy data. Noise is a random error or variance in a measured
variable. In general, noise is not interesting in data analysis, including outlier detection. For
example, in credit card fraud detection, a customer’s purchase behavior can be modeled as a
random variable. A customer may generate some “noise transactions” that may seem like
“random errors” or “variance,” such as by buying a bigger lunch one day, or having one
more cup of coffee than usual. Such transactions should not be treated as outliers; otherwise,
the credit card company would incur heavy costs from verifying that many transactions. The
company may also lose customers by bothering them with multiple false alarms. As in many
other data analysis and data mining tasks, noise should be removed before outlier detection.
Outliers are interesting because they are suspected of not being generated by the same
mechanisms as the rest of the data. Therefore, in outlier detection, it is important to
 Outlier detection is also related to novelty detection in evolving data sets.
 For example, by monitoring a social media web site where new content is incoming,
novelty detection may identify new topics and trends in a timely manner.
 Novel topics may initially appear as outliers.
 To this extent, outlier detection and novelty detection share some similarity in
modeling and detection methods
 In general, outliers can be classified into three categories, namely global outliers,
contextual (or conditional) outliers, and collective outliers.
 Global Outliers
 In a given data set, a data object is a global outlier if it deviates significantly from
the rest of the data set.
 Global outliers are sometimes called point anomalies, and are the simplest type of
outliers.
 Most outlier detection methods are aimed at finding global outliers.
 Global outlier detection is important in many applications.
 Consider intrusion detection in computer networks, for example.
 If the communication behavior of a computer is very different from the normal
patterns (e.g., a large number of packages is broadcast in a short time), this behavior
may be considered as a global outlier and the corresponding computer is a suspected
victim of hacking.
 As another example, in trading transaction auditing systems, transactions that do not
follow the regulations are considered as global outliers and should be held for further
examination.
 Contextual Outliers
 “The temperature today is 35 º C. Is it exceptional (i.e., an outlier)?” It depends, for
example, on the time and location! If it is in winter in Hyderabad, yes, it is an
outlier.
 If it is a summer day in Hyderabad, then it is normal.
 Unlike global outlier detection, in this case, whether or not today’s temperature
value is an outlier depends on the context—the date, the location, and possibly some
other factors.
 Contextual outliers are a generalization of local outliers
 In credit card fraud detection, in addition to global outliers, an analyst may consider
outliers in different contexts.
 Consider customers who use more than 90% of their credit limit.
 If one such customer is viewed as belonging to a group of customers with low credit
limits, then such behavior may not be considered an outlier.
 However, similar behavior of customers from a high-income group may be
considered outliers if their balance often exceeds their credit limit.
 Such outliers may lead to business opportunities—raising credit limits for such
customers can bring in new revenue
 Collective Outliers
 Suppose you are a supply-chain manager of AllElectronics. You handle thousands of
orders and shipments every day. If the shipment of an order is delayed, it may not be
considered an outlier because, statistically, delays occur from time to time.
 However, you have to pay attention if 100 orders are delayed on a single day.
 Those 100 orders as a whole form an outlier, although each of them may not be
regarded as an outlier if considered individually.
 You may have to take a close look at those orders collectively to understand the
shipment problem.


 Collective outliers. In Figure the black objects as a whole form a collective outlier
because the density of those objects is much higher than the rest in the data set.
 However, every black object individually is not an outlier with respect to the whole
data set.
 Collective outlier detection has many important applications.
 For example, in intrusion detection, a denial-of-service package from one computer
to another is considered normal, and not an outlier at all.
 However, if several computers keep sending denial-of-service packages to each
other, they as a whole should be considered as a collective outlier.

Method General Characteristics
Partitioning methods
 Find mutually exclusive clusters of spherical shape Distance-based
 May use mean or medoid (etc.) to represent cluster center
 Effective for small- to medium-size data sets
Hierarchical methods
 Clustering is a hierarchical decomposition (i.e., multiple levels)
 Cannot correct erroneous merges or splits
 May incorporate other techniques like microclustering or
 consider object “linkages”
Density-based methods
 Can find arbitrarily shaped clusters
 Clusters are dense regions of objects in space that are separated by low-density
regions
 Cluster density: Each point must have a minimum number of points within its
“neighborhood” May filter out outliers
Grid-based methods
 Use a multiresolution grid data structure
 Fast processing time (typically independent of the number of data objects, yet
dependent on grid size)

You might also like