You are on page 1of 17

Chapter 3

Unsupervised Learning

3.1 Clustering
Clustering is the task of dividing the unlabeled data or data points into
different clusters such that similar data points fall in the same cluster than
those which differ from the others. In simple words, the aim of the clustering
process is to segregate groups with similar traits and assign them into clusters.
Let’s understand this with an example. Suppose you are the head of a
rental store and wish to understand the preferences of your customers to scale
up your business. Is it possible for you to look at the details of each customer
and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is cluster all of your customers into, say 10 groups
based on their purchasing habits and use a separate strategy for customers in
each of these 10 groups. And this is what we call clustering. Now that we
understand what clustering is. Let’s take a look at its different types.

3.2 Clustering Exploratory Data Analysis


Exploratory data analysis, also known as EDA, is a method of analyzing
datasets to identify and summarize their key features. Using EDA, we can
easily understand the dataset, discover patterns, spot outliers, and investigate
the correlation between variables. Curiosity and understanding the context
of the data are essential considerations in this process, as they will help solve
some of the most fundamental issues. EDA also assists us in deciding which
feature to consider when developing a machine learning model.These are the
goals of EDA:
1. Data Cleaning: EDA involves examining the information for errors,
lacking values, and inconsistencies. It includes techniques including

32
records imputation, managing missing statistics, and figuring out and
getting rid of outliers.

2. Descriptive Statistics: EDA utilizes precise records to recognize the


important tendency, variability, and distribution of variables. Measures
like suggest, median, mode, preferred deviation, range, and percentiles
are usually used.

3. Data Visualization: EDA employs visual techniques to represent the


statistics graphically. Visualizations consisting of histograms, box plots,
scatter plots, line plots, heatmaps, and bar charts assist in identifying
styles, trends, and relationships within the facts.

4. Feature Engineering: EDA allows for the exploration of various variables


and their adjustments to create new functions or derive meaningful
insights. Feature engineering can contain scaling, normalization, binning,
encoding express variables, and creating interplay or derived variables.

5. Correlation and Relationships: EDA allows discover relationships and


dependencies between variables. Techniques such as correlation analysis,
scatter plots, and pass-tabulations offer insights into the power and
direction of relationships between variables.

6. Data Segmentation: EDA can contain dividing the information into


significant segments based totally on sure standards or traits. This
segmentation allows advantage insights into unique subgroups inside
the information and might cause extra focused analysis.

7. Hypothesis Generation: EDA aids in generating hypotheses or studies


questions based totally on the preliminary exploration of the data. It
facilitates form the inspiration for in addition evaluation and model
building.

8. Data Quality Assessment: EDA permits for assessing the nice and
reliability of the information. It involves checking for records integrity,
consistency, and accuracy to make certain the information is suitable
for analysis.

3.2.1 Types of EDA


Depending on the number of columns we are analyzing we can divide EDA
into two types. EDA, or Exploratory Data Analysis, refers back to the
method of analyzing and analyzing information units to uncover styles, pick

33
out relationships, and gain insights. There are various sorts of EDA strategies
that can be hired relying on the nature of the records and the desires of the
evaluation. Here are some not unusual kinds of EDA:

• Univariate Analysis: This sort of evaluation makes a speciality of ana-


lyzing character variables inside the records set. It involves summarizing
and visualizing a unmarried variable at a time to understand its dis-
tribution, relevant tendency, unfold, and different applicable records.
Techniques like histograms, field plots, bar charts, and precis informa-
tion are generally used in univariate analysis.

• Bivariate Analysis: Bivariate evaluation involves exploring the con-


nection between variables. It enables find associations, correlations,
and dependencies between pairs of variables. Scatter plots, line plots,
correlation matrices, and move-tabulation are generally used strategies
in bivariate analysis.

• Multivariate Analysis: Multivariate analysis extends bivariate evaluation


to encompass greater than variables. It ambitions to apprehend the
complex interactions and dependencies among more than one variables
in a records set. Techniques inclusive of heatmaps, parallel coordinates,
aspect analysis, and primary component analysis (PCA) are used for
multivariate analysis.

• Time Series Analysis: This type of analysis is mainly applied to statistics


sets that have a temporal component. Time collection evaluation
entails inspecting and modeling styles, traits, and seasonality inside the
statistics through the years. Techniques like line plots, autocorrelation
analysis, transferring averages, and ARIMA (AutoRegressive Integrated
Moving Average) fashions are generally utilized in time series analysis.

• Missing Data Analysis: Missing information is a not unusual issue in


datasets, and it may impact the reliability and validity of the evaluation.
Missing statistics analysis includes figuring out missing values, know-how
the patterns of missingness, and using suitable techniques to deal with
missing data. Techniques along with lacking facts styles, imputation
strategies, and sensitivity evaluation are employed in lacking facts
evaluation.

• Outlier Analysis: Outliers are statistics factors that drastically deviate


from the general sample of the facts. Outlier analysis includes identifying
and knowledge the presence of outliers, their capability reasons, and their

34
impact at the analysis. Techniques along with box plots, scatter plots,
z-rankings, and clustering algorithms are used for outlier evaluation.

• Data Visualization: Data visualization is a critical factor of EDA that


entails creating visible representations of the statistics to facilitate un-
derstanding and exploration. Various visualization techniques, inclusive
of bar charts, histograms, scatter plots, line plots, heatmaps, and inter-
active dashboards, are used to represent exclusive kinds of statistics.

These are just a few examples of the types of EDA techniques that can be
employed at some stage in information evaluation. The choice of strategies
relies upon on the information traits, research questions, and the insights
sought from the analysis.

3.3 Types of Clustering in Machine Learning


Clustering broadly divides into two subgroups:

• Hard Clustering: Each input data point either fully belongs to a cluster
or not. For instance, in the example above, every customer is assigned
to one group out of the ten.

• Soft Clustering: Rather than assigning each input data point to a


distinct cluster, it assigns a probability or likelihood of the data point
being in those clusters. For example, in the given scenario, each customer
receives a probability of being in any of the ten retail store clusters.

3.4 Different Types of Clustering Algorithms


Since the task of clustering is subjective, the means that can be used for
achieving this goal are plenty. Every methodology follows a different set of
rules for defining the ‘similarity’ among data points. In fact, there are more
than 100 clustering algorithms known. But few of the algorithms are used
popularly. Let’s look at them in detail:

• Connectivity Models: As the name suggests, these models are based


on the notion that the data points closer in data space exhibit more
similarity to each other than the data points lying farther away. These
models can follow two approaches. In the first approach, they start
by classifying all data points into separate clusters & then aggregating
them as the distance decreases. In the second approach, all data points

35
are classified as a single cluster and then partitioned as the distance
increases. Also, the choice of distance function is subjective. These
models are very easy to interpret but lack scalability for handling big
datasets. Examples of these models are the hierarchical clustering
algorithms and their variants.
• Centroid Models: These clustering algorithms iterate, deriving similarity
from the proximity of a data point to the centroid or cluster center.
The k-Means clustering algorithm, a popular example, falls into this
category. These models necessitate specifying the number of clusters
beforehand, requiring prior knowledge of the dataset. They iteratively
run to discover local optima.
• Distribution Models: These clustering models are based on the notion
of how probable it is that all data points in the cluster belong to the
same distribution (For example: Normal, Gaussian). These models
often suffer from overfitting. A popular example of these models is the
Expectation-maximization algorithm which uses multivariate normal
distributions.
• Density Models: These models search the data space for areas of the
varied density of data points in the data space. They isolate different
dense regions and assign the data points within these regions to the
same cluster. Popular examples of density models are DBSCAN and
OPTICS. These models are particularly useful for identifying clusters of
arbitrary shape and detecting outliers, as they can detect and separate
points that are located in sparse regions of the data space, as well as
points that belong to dense regions.

3.5 Hierarchical Clustering


Hierarchical clustering is a popular method for grouping objects. It creates
groups so that objects within a group are similar to each other and different
from objects in other groups. Clusters are visually represented in a hierarchical
tree called a dendrogram. Hierarchical clustering has a couple of key benefits:
• There is no need to pre-specify the number of clusters. Instead, the
dendrogram can be cut at the appropriate level to obtain the desired
number of clusters.
• Data is easily summarized/organized into a hierarchy using dendrograms.
Dendrograms make it easy to examine and interpret clusters.

36
Figure 3.1: Difference between types of Machine Learning

3.5.1 Applications of Hierarchical Clustering


There are many real-life applications of Hierarchical clustering. They include:

• Bioinformatics: grouping animals according to their biological features


to reconstruct phylogeny trees

• Business: dividing customers into segments or forming a hierarchy of


employees based on salary.

• Image processing: grouping handwritten characters in text recognition


based on the similarity of the character shapes.

• Information Retrieval: categorizing search results based on the query.

3.5.2 Hierarchical clustering types


There are two main types of hierarchical clustering:

37
• Agglomerative: Initially, each object is considered to be its own cluster.
According to a particular procedure, the clusters are then merged step
by step until a single cluster remains. At the end of the cluster merging
process, a cluster containing all the elements will be formed.

• Divisive: The Divisive method is the opposite of the Agglomerative


method. Initially, all objects are considered in a single cluster. Then
the division process is performed step by step until each object forms
a different cluster. The cluster division or splitting procedure is car-
ried out according to some principles that maximum distance between
neighboring objects in the cluster.

Between Agglomerative and Divisive clustering, Agglomerative clustering


is generally the preferred method. The below example will focus on Agglom-
erative clustering algorithms because they are the most popular and easiest
to implement.

Agglomerative Approach
This Algorithm is also referred as Bottom-up approach. This approach treats
each and every data point as a single cluster and then merges each cluster by
considering the similarity (distance) in each individual cluster until a single
large cluster is obtained or when some condition is satisfied.
Algorithm
1. Initialize all n data points into N individual clusters.

2. Find the cluster pairs with the least distance (closest distance) and
combine them as one single cluster.

3. Calculate pair-wise distance between the clusters at present that is the


new formed cluster and the priority available clusters.

4. Repeat steps 2 and 3 until all data samples are merged into a single
large cluster of size N
Advantages
• Easy to identify nested clusters.

• Gives better results and ease in implementation.

• They are suitable for automation.

• Reduces the effect of initial values of cluster on the clustering results.

38
• Reduces the computing time and space complexity.

Disadvantages

• It can never undo what was done previously.

• Difficulty in handling different sized clusters and convex shapes lead to


increase in time complexity

• There is no direct minimization of objective function.

• Sometimes there is difficulty in identifying the exact number of clusters


by the Dendrogram.

Divisive Approach
This approach is also referred as the top-down approach. In this, we consider
the entire data sample set as one cluster and continuously splitting the cluster
into smaller clusters iteratively. It is done until each object in one cluster
or the termination condition holds. This method is rigid, because once a
merging or splitting is done, it can never be undone.
Algorithm

1. Initially, initiate the process with one cluster containing all the samples.

2. Select a largest cluster from the cluster that contains widest diameter.

3. Detect the data point in the cluster found in step 2 with the minimum
average similarity to the other elements in that cluster.

4. The first element to be added to the fragment group is the data samples
found in step 3.

5. Detect the element in the original group which has the highest average
similarity with the fragment group.

6. If the average similarity of element obtained in step 5 with the fragment


group is greater than its average similarity with the original group then
assign the data sample to the fragment group and go to step 5; otherwise
do nothing;

7. Repeat the step 2 to 6 until each data point is separated into individual
clusters

Advantage

39
• It produces more accurate hierarchies than bottom-up algorithm in
some circumstances.

Disadvantages

• Top down approach is computationally more complex than bottom up


approach because we need a second flat clustering algorithm.

• Use of different distance metrics for measuring distance between clusters


may generate different results.

Divisive is the opposite of Agglomerative, it starts off with all the points into
one cluster and divides them to create more clusters. These algorithms create
a distance matrix of all the existing clusters and perform the linkage between
the clusters depending on the criteria of the linkage. The clustering of the
data points is represented by using a dendrogram. There are different types
of linkages: –

• Single Linkage: – In single linkage the distance between the two clusters
is the shortest distance between points in those two clusters.

• Complete Linkage: – In complete linkage, the distance between the two


clusters is the farthest distance between points in those two clusters.

• Average Linkage: – In average linkage the distance between the two


clusters is the average distance of every point in the cluster with every
point in another cluster.

3.5.3 Hierarchical clustering steps


Hierarchical clustering employs a measure of distance/similarity to create new
clusters. Steps for Agglomerative clustering can be summarized as follows:

1. Compute the proximity matrix using a particular distance metric

2. Each data point is assigned to a cluster

3. Merge the clusters based on a metric for the similarity between clusters

4. Update the distance matrix

5. Repeat Step 3 and Step 4 until only a single cluster remains

40
3.6 Centroid-based clustering algorithms /
Partitioning clustering algorithms
In centroid/partitioning clustering, clusters are represented by a central vector,
which may not necessarily be a member of the dataset. Even in this particular
clustering type, the value of K needs to be chosen. This is an optimization
problem: finding the number of centroids or the value of K and assigning the
objects to nearby cluster centers. These steps need to be performed in such a
way that the squared distance from clusters is maximized.
In centroid/partitioning clustering, clusters are represented by a central
vector, which may not necessarily be a member of the dataset. Even in this
particular clustering type, the value of K needs to be chosen. This is an
optimization problem: finding the number of centroids or the value of K
and assigning the objects to nearby cluster centers. These steps need to be
performed in such a way that the squared distance from clusters is maximized.

3.6.1 K-Means
One of the most widely used centroid-based clustering algorithms is K-Means,
and one of its drawbacks is that you need to choose a K value in advance.
K-Means clustering algorithm The K-Means algorithm splits the given
dataset into a predefined(K) number of clusters using a particular distance
metric. The center of each cluster/group is called the centroid.
1. Choosing the number of clusters: The first step is to define the K
number of clusters in which we will group the data. Let’s select K=3.
2. Initializing centroids: Centroid is the center of a cluster but initially, the
exact center of data points will be unknown so, we select random data
points and define them as centroids for each cluster. We will initialize 3
centroids in the dataset.

41
3. Assign data points to the nearest cluster: Now that centroids are
initialized, the next step is to assign data points Xn to their closest
cluster centroid Ck
In this step, we will first calculate the distance between data point X
and centroid C using Euclidean Distance metric. And then choose the

cluster for data points where the distance between the data point and
the centroid is minimum.

4. Re-initialize centroids: Next, we will re-initialize the centroids by calcu-


lating the average of all data points of that cluster.

5. Repeat steps 3 and 4: We will keep repeating steps 3 and 4 until we


have optimal centroids and the assignments of data points to correct
clusters are not changing anymore.

42
3.6.2 Advantages and Disadvantages
Advantages The following are some advantages of K-Means clustering algo-
rithms

1. It is very easy to understand and implement.

2. If we have large number of variables then, K-means would be faster


than Hierarchical clustering.

3. On re-computation of centroids, an instance can change the cluster.

4. Tighter clusters are formed with K-means as compared to Hierarchical


clustering.

Disadvantages The following are some disadvantages of K-Means clus-


tering algorithms

1. It is a bit difficult to predict the number of clusters i.e. the value of k.

2. Output is strongly impacted by initial inputs like number of clusters


(value of k).

3. Order of data will have strong impact on the final output.

4. It is very sensitive to rescaling. If we will rescale our data by means


of normalization or standardization, then the output will completely
change.final output.

5. It is not good in doing clustering job if the clusters have a complicated


geometric shape.

43
3.6.3 Applications of K-Means Clustering Algorithm
The main goals of cluster analysis are
• To get a meaningful intuition from the data we are working with.

• Cluster-then-predict where different models will be built for different


subgroups.
To fulfill the above-mentioned goals, K-means clustering is performing
well enough. It can be used in following applications
1. Market segmentation

2. Document Clustering

3. Image segmentation

4. Image compression

5. Customer segmentation

6. Analyzing the trend on dynamic data

3.7 DBSCAN
DBSCAN is the abbreviation for Density-Based Spatial Clustering of Ap-
plications with Noise. It is an unsupervised clustering algorithm.DBSCAN
clustering can work with clusters of any size from huge amounts of data
and can work with datasets containing a significant amount of noise. It is
basically based on the criteria of a minimum number of points within a region.
DBSCAN algorithm can cluster densely grouped points efficiently into one
cluster. It can identify local density in the data points among large datasets.
DBSCAN can very effectively handle outliers. An advantage of DBSACN over
the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN.
DBSCAN algorithm depends upon two parameters epsilon and minPoints.
• Epsilon is defined as the radius of each data point around which the
density is considered.

• minPoints is the number of points required within the radius so that


the data point becomes a core point.
The circle can be extended to higher dimensions.

44
3.7.1 DBSCAN Algorithm
In the DBSCAN algorithm, a circle with a radius epsilon is drawn around
each data point and the data point is classified into Core Point, Border Point,
or Noise Point. The data point is classified as a core point if it has minPoints
number of data points with epsilon radius. If it has points less than minPoints
it is known as Border Point and if there are no points inside epsilon radius it
is considered a Noise Point.
Let us understand working through an example.

In the above figure, we can see that point A has no points inside epsilon(e)
radius. Hence it is a Noise Point. Point B has minPoints(=4) number of
points with epsilon e radius , thus it is a Core Point. While the point has only
1 ( less than minPoints) point, hence it is a Border Point. The above figure

shows us a cluster created by DBCAN with minPoints = 3. Here, we draw a


circle of equal radius epsilon around every data point. These two parameters
help in creating spatial clusters.

45
All the data points with at least 3 points in the circle including itself are
considered as Core points represented by red color. All the data points with
less than 3 but greater than 1 point in the circle including itself are considered
as Border points. They are represented by yellow color. Finally, data points
with no point other than itself present inside the circle are considered as Noise
represented by the purple color. For locating data points in space, DBSCAN
uses Euclidean distance, although other methods can also be used (like great
circle distance for geographical data). It also needs to scan through the entire
dataset once, whereas in other algorithms we have to do it multiple times.
Steps Involved in DBSCAN Algorithm
1. First, all the points within epsilon radius are found and the core points
are identified with number of points greater than or equal to minPoints.
2. Next, for each core point, if not assigned to a particular cluster, a new
cluster is created for it.
3. All the densely connected points related to the core point are found and
assigned to the same cluster. Two points are called densely connected
points if they have a neighbor point that has both the points within
epsilon distance.
4. Then all the points in the data are iterated, and the points that do not
belong to any cluster are marked as noise.

3.7.2 Advantages of the DBSCAN Algorithm


DBSCAN does not require the number of centroids to be known beforehand
as in the case with the K-Means Algorithm.
1. It can find clusters with any shape.
2. It can also locate clusters that are not connected to any other group or
clusters. It can work well with noisy clusters.
3. It is robust to outliers.

3.7.3 Disadvantages of the DBSCAN Algorithm


1. It does not work with datasets that have varying densities.
2. Cannot be employed with multiprocessing as it cannot be partitioned.
3. Cannot find the right cluster if the dataset is sparse.
4. It is sensitive to parameters epsilon and minPoints

46
3.7.4 Applications
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
popular clustering algorithm in data mining and machine learning, particularly
useful for tasks involving spatial data analysis. Here are some applications
where DBSCAN is commonly used:

1. Spatial Data Clustering: DBSCAN is widely used for clustering spatial


datasets, such as geographical data, GPS data, and image data. It
can automatically identify clusters of arbitrary shape and handle noise
effectively.

2. Anomaly Detection: DBSCAN can be used to detect outliers or anoma-


lies in datasets. Objects that do not belong to any cluster or are in
low-density regions can be considered anomalies.

3. Image Segmentation: In image processing, DBSCAN can be employed


for segmenting images based on similarity in pixel values. This helps
in tasks like object recognition, background subtraction, and image
compression.

4. Customer Segmentation: DBSCAN can segment customers based on


their purchasing behavior, geographical location, or any other relevant
features. This segmentation helps businesses in targeted marketing, per-
sonalized recommendations, and understanding customer demographics.

5. Network Analysis: DBSCAN can be used to analyze networks, such


as social networks or transportation networks. It can identify clusters
of closely connected nodes, which is valuable for community detection,
identifying influential nodes, or finding network bottlenecks.

6. Genomics: DBSCAN is used in genomics for clustering gene expression


data to identify patterns or groups of genes with similar expression pro-
files. This aids in understanding gene functions, identifying biomarkers,
and studying diseases.

7. Time Series Data Analysis: DBSCAN can be adapted to handle time


series data, where it can cluster similar temporal patterns. This is
useful in various domains like finance for identifying trading patterns,
in healthcare for monitoring patient data, or in environmental science
for analyzing climate data.

8. Fraud Detection: DBSCAN can detect unusual patterns in transaction


data, helping in fraud detection and prevention. It can identify clusters

47
of transactions that deviate significantly from normal behavior, indicat-
ing potential fraudulent activities.

3.8 Some Extra Questions:


1. List out the difference between Hierarchical Clustering & DBSCAN.

2. What are the various distance Measures used in clustering?

3. Explain various evaluation parameters used in clustering.

4. List out the difference between agglomerative and divisive clustering.

5. What are the steps involved in EDA for clustering explain with an
example.

48

You might also like