You are on page 1of 29

SCHOOL OF COMPUTING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Machine Learning Essentials – SCSA1415

1
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT III – UNSUPERVISED LEARNING

2
UNIT 3 – UNSUPERVISED LEARNING

Clustering-K Means Clustering-Supervised Learning after Clustering-Density Based Clustering


Methods-Hierarchical Based clustering methods-Partitioning methods- Grid based methods.
Dimensionality Reduction: Linear Discriminant Analysis -Principal Component Analysis.

Clustering

Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group than those in other
groups. In simple words, the aim is to segregate groups with similar traits and assign them into
clusters.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look at
details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing
habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.

Now, that we understand what is clustering. Let’s take a look at the types of clustering.

Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :

 Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not.
For example, in the above example each customer is put into one group out of the 10 groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example, from the
above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.

Types of clustering algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty.
Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In

3
fact, there are more than 100 clustering algorithms known. But few of the algorithms are used
popularly, let’s look at them in detail:

 Connectivity models: As the name suggests, these models are based on the notion that the data points
closer in data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start with classifying all data
points into separate clusters & then aggregating them as the distance decreases. In the second
approach, all data points are classified as a single cluster and then partitioned as the distance
increases. Also, the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering
algorithm and its variants.

 Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm
is a popular algorithm that falls into this category. In these models, the no. of clusters required at the
end have to be mentioned beforehand, which makes it important to have prior knowledge of the
dataset. These models run iteratively to find the local optima.

 Distribution models: These clustering models are based on the notion of how probable is it that all
data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These
models often suffer from overfitting. A popular example of these models is Expectation-
maximization algorithm which uses multivariate normal distributions.

 Density Models: These models search the data space for areas of varied density of data points in the
data space. It isolates various different density regions and assign the data points within these regions
in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

Now I will be taking you through two of the most popular clustering algorithms in detail – K Means
clustering and Hierarchical clustering. Let’s begin.

K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in these 5 steps :

4
1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D
space.

2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.

3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red
cross and those in grey cluster using grey cross.

5
4. Re-assign each point to the closest cluster centroid : Note that only the data point at the bottom
is assigned to the red cluster even though its closer to the centroid of grey cluster. Thus, we
assign that data point into grey cluster

5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.

6
6. Repeat steps 4 and 5 until no improvements are possible :
Similarly, we’ll repeat the 4th and 5th steps until we’ll
reach global optima. When there will be no further
switching of data points between two clusters for two
successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.

Supervised Learning after Clustering


Firstly, the N data points are assigned to N different clusters with
one data point in each cluster.
Then, two data points with minimum euclidean distance between
them are merged into a
single cluster. Then, two clusters with minimum centroid distance
between them are merged
into a single cluster.

Density based clustering

Density-based clustering refers to a method that is based on local cluster


criterion, such as density connected points. In this tutorial, we will discuss
density-based clustering with examples.

What is Density-based clustering?

Density-Based Clustering refers to one of the most popular unsupervised learning


methodologies used in model building and machine learning algorithms. The data points in the
region separated by two clusters of low point density are considered as noise. The surroundings
with a radius ε of a given object are known as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a minimum number, MinPts of objects, then it is
called a core object.

7
Density-Based Clustering - Background

There are two different parameters to calculate the density-based clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if

i belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts

Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from
ii.

8
Density connected:

A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.

Working of Density-Based Clustering

Suppose a set of objects is denoted by D', we can say that an object I is directly density
reachable form the object j only if it is located within the ε neighborhood of j, and j is a core
object.

An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that i i + 1
is directly density reachable from ii with respect to ε and MinPts.

An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.
9
Major Features of Density-Based Clustering

The primary features of Density-based clustering are given below.

o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.

Density-Based Clustering Methods

DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a
density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database
with outliers.

OPTICS

OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order
of database with respect to its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based clustering related to a long range of
parameter settings. OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.

DENCLUE

Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical description


of arbitrarily shaped clusters in high dimension state of data, and it is good for data sets with a
huge amount of noise.

10
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds
hierarchy of clusters. This algorithm starts with all the data points assigned to a
cluster of their own. Then two nearest clusters are merged into the same cluster.
In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using dendrogram. The


dendrogram can be interpreted as:
Dendrogram

11
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters
are then merged till we have just one cluster at the top. The height in the dendrogram at which two
clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing
the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram
cut by a horizontal line that can transverse the maximum distance vertically without intersecting a
cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.

Dendrogram Distance

Two important things that you should know about hierarchical clustering are:

 This algorithm has been implemented above using bottom up approach. It is also possible to follow
top-down approach starting with all data points assigned in the same cluster and recursively
performing splits till each data point is assigned a separate cluster.
 The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters :

12
Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most
popularapplications of clustering are:

 Recommendation engines
 Market segmentation
 Social network analysis
 Search result grouping
 Medical imaging
 Image segmentation
 Anomaly detection

Partitioning methods
K-means clustering is a partitioning method and as anticipated, this method decomposes a
dataset into a set of disjoint clusters. Given a dataset, a partitioning method constructs several
partitions of this data, with each partition representing a cluster. In this type, the dataset is
divided into a set of k groups, where K is used to define the number of pre-defined groups. The
cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid. Partitional clustering decomposes a data set
into a set of disjoint clusters. Given a data set of N points, a partitioning method constructs K
(N ≥ K) partitions of the data, with each partition representing a cluster.

Grid based methods.


The grid-based clustering methods use a multi-resolution grid data structure. It quantizes the
object areas into a finite number of cells that form a grid structure on which all of the
operations for clustering are implemented. The benefit of the method is its quick processing
time, which is generally independent of the number of data objects, still dependent on only the
multiple cells in each dimension in the quantized space.
An instance of the grid-based approach involves STING, which explores statistical data stored
in the grid cells, WaveCluster, which clusters objects using a wavelet transform approach, and

13
CLIQUE, which defines a grid-and density-based approach for clustering in high-dimensional
data space.
STING is a grid-based multiresolution clustering method in which the spatial area is divided
into rectangular cells. There are generally several levels of such rectangular cells
corresponding to multiple levels of resolution, and these cells form a hierarchical mechanism
each cell at a high level is separation to form several cells at the next lower level. Statistical
data regarding the attributes in each grid cell (including the mean, maximum, and minimum
values) is precomputed and stored.
Statistical parameters of higher-level cells can simply be calculated from the parameters of the
lower-level cells. These parameters contain the following: the attribute-independent parameter,
count, and the attribute-dependent parameters, mean, stdev (standard deviation), min
(minimum), max (maximum); and the type of distribution that the attribute value in the cell
follows, including normal, uniform, exponential, or none (if the distribution is anonymous).
When the records are loaded into the database, the parameters count, mean, stdev, min, and a
max of the bottom-level cells are computed directly from the records. The value of distribution
can be assigned by the user if the distribution type is known beforehand or obtained by
hypothesis tests including the χ2 test.
The kind of distribution of a higher-level cell that can be computed depends on the majority of
distribution types of its corresponding lower-level cells in conjunction with a threshold
filtering procedure. If the distributions of the lower-level cells disagree with each other and
decline the threshold test, the distribution type of the high-level cell is set to none.
The statistical parameters can be used in top-down, grid-based approaches as follows. First, a
layer within the hierarchical architecture is decided from which the query-answering
procedure is to start. This layer generally includes a small number of cells. For every cell in
the current layer, it can compute the confidence interval (or estimated range of probability)
reflecting the cell’s relevancy to the given query.
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-
dimensional space into a low-dimensional space so that the low-dimensional representation
retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Working in high-dimensional spaces can be undesirable for many reasons; raw data are
often sparse as a consequence of the curse of dimensionality, and analyzing the data is
usually computationally intractable (hard to control or deal with). Dimensionality reduction is
common in fields that deal with large numbers of observations and/or large numbers of
variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.
Methods are commonly divided into linear and nonlinear approaches. Approaches can also be
divided into feature selection and feature extraction. Dimensionality reduction can be used
for noise reduction, data visualization, cluster analysis, or as an intermediate step to facilitate
other analyses.

Linear discriminant analysis (LDA), normal discriminant analysis (NDA),


or discriminant function analysis is a generalization of Fisher's linear discriminant, a
method used in statistics and other fields, to find a linear combination of features that
14
characterizes or separates two or more classes of objects or events. The resulting combination
may be used as a linear classifier, or, more commonly, for dimensionality reduction before
later classification.

LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also
attempt to express one dependent variable as a linear combination of other features or
measurements.[1][2] However, ANOVA uses categorical independent variables and
a continuous dependent variable, whereas discriminant analysis has continuous independent
variables and a categorical dependent variable (i.e. the class label).[3] Logistic
regression and probit regression are more similar to LDA than ANOVA is, as they also explain
a categorical variable by the values of continuous independent variables. These other methods
are preferable in applications where it is not reasonable to assume that the independent
variables are normally distributed, which is a fundamental assumption of the LDA method.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that
they both look for linear combinations of variables which best explain the data. [4] LDA
explicitly attempts to model the difference between the classes of data. PCA, in contrast, does
not take into account any difference in class, and factor analysis builds the feature
combinations based on differences rather than similarities. Discriminant analysis is also
different from factor analysis in that it is not an interdependence technique: a distinction
between independent variables and dependent variables (also called criterion variables) must
be made.

LDA works when the measurements made on independent variables for each observation are
continuous quantities. When dealing with categorical independent variables, the equivalent
technique is discriminant correspondence analysis. [5][6]

Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each
case must have a score on one or more quantitative predictor measures, and a score on a group
measure.[7] In simple terms, discriminant function analysis is classification - the act of
distributing things into groups, classes or categories of the same type.

15
16
17
18
19
20
21
22
Principal component analysis (PCA) is a popular technique for analyzing large datasets
containing a high number of dimensions/features per observation, increasing the
interpretability of data while preserving the maximum amount of information, and enabling the
visualization of multidimensional data. Formally, PCA is a statistical technique for reducing
the dimensionality of a dataset. This is accomplished by linearly transforming the data into a
new coordinate system where (most of) the variation in the data can be described with fewer
dimensions than the initial data. Many studies use the first two principal components in order
to plot the data in two dimensions and to visually identify clusters of closely related data
points. Principal component analysis has applications in many fields such as population
genetics, microbiome studies, and atmospheric science.

The principal components of a collection of points in a real coordinate space are a sequence
of p unit vectors, where the ith vector is the direction of a line that best fits the data while
being orthogonal to the first i-1 vectors. Here, a best-fitting line is defined as one that
minimizes the average squared perpendicular distance from the points to the line. These
directions constitute an orthonormal basis in which different individual dimensions of the data
are linearly uncorrelated. Principal component analysis is the process of computing the
principal components and using them to perform a change of basis on the data, sometimes
using only the first few principal components and ignoring the rest.

In data analysis, the first principal component of a set of p variables, presumed to be jointly
normally distributed, is the derived variable formed as a linear combination of the original
variables that explains the most variance. The second principal component explains the most
variance in what is left once the effect of the first component is removed, and we may proceed
through p iterations until all the variance is explained. PCA is most commonly used when
many of the variables are highly correlated with each other and it is desirable to reduce their
number to an independent set.

PCA is used in exploratory data analysis and for making predictive models. It is commonly
used for dimensionality reduction by projecting each data point onto only the first few
principal components to obtain lower-dimensional data while preserving as much of the data's
variation as possible. The first principal component can equivalently be defined as a direction
that maximizes the variance of the projected data. The i-th principal component can be taken as
a direction orthogonal to the first i-1 principal components that maximizes the variance of the
projected data.

23
For either objective, it can be shown that the principal components are eigenvectors of the
data's covariance matrix. Thus, the principal components are often computed by
eigendecomposition of the data covariance matrix or singular value decomposition of the data
matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely
related to factor analysis. Factor analysis typically incorporates more domain specific
assumptions about the underlying structure and solves eigenvectors of a slightly different
matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate
systems that optimally describe the cross-covariance between two datasets while PCA defines
a new orthogonal coordinate system that optimally describes variance in a single dataset.
Robust and L1-norm-based variants of standard PCA have also been proposed.

PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in


roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are
the eigenvectors of the covariance matrix scaled by the square root of the corresponding
eigenvalue, and shifted so their tails are at the mean.

On its own, PCA is used across a variety of use cases:

1. Visualize multidimensional data. Data visualizations are a great tool


for communicating multidimensional data as 2- or 3-dimensional
plots.
2. Compress information. Principal Component Analysis is used to
compress information to store and transmit data more efficiently. For
example, it can be used to compress images without losing too much

24
quality, or in signal processing. The technique has successfully been
applied across a wide range of compression problems in pattern
recognition (specifically face recognition), image recognition, and
more.
3. Simplify complex business decisions. PCA has been employed to
simplify traditionally complex business decisions. For example,
traders use over 300 financial instruments to manage portfolios. The
algorithm has proven successful in the risk management of interest
rate derivative portfolios, lowering the number of financial
instruments from more than 300 to just 3-4 principal components.
4. Clarify convoluted scientific processes. The algorithm has been
applied extensively in the understanding of convoluted and
multidirectional factors, which increase the probability of neural
ensembles to trigger action potentials.

When PCA is used as part of preprocessing, the algorithm is applied


to:

1. Reduce the number of dimensions in the training dataset.


2. De-noise the data. Because PCA is computed by finding the
components which explain the greatest amount of variance, it
captures the signal in the data and omits the noise.

Let's take a look at how Principal Component Analysis is computed.

How is PCA calculated?

There are multiple ways to calculate PCA:

1. Eigendecomposition of the covariance matrix


2. Singular value decomposition of the data matrix
3. Eigenvalue approximation via power iterative computation
4. Non-linear iterative partial least squares (NIPALS) computation
5. … and more.

25
Let’s take a closer look at the first method - eigendecomposition of
the covariance matrix - to gain a deeper appreciation of PCA. There
are several steps in computing PCA:

1. Feature standardization. We standardize each feature to have a mean


of 0 and a variance of 1. As we explain later in assumptions and
limitations, features with values that are on different orders of
magnitude prevent PCA from computing the best principal
components.
2. Obtain the covariance matrix computation. The covariance matrix is
a square matrix, of d x d dimensions, where d stands for “dimension”
(or feature or column, if our data is tabular). It shows the pairwise
feature correlation between each feature.
3. Calculate the eigendecomposition of the covariance matrix. We
calculate the eigenvectors (unit vectors) and their associated
eigenvalues (scalars by which we multiply the eigenvector) of the
covariance matrix. If you want to brush up on your linear
algebra, this is a good resource to refresh your knowledge of
eigendecomposition.
4. Sort the eigenvectors from the highest eigenvalue to the lowest. The
eigenvector with the highest eigenvalue is the first principal
component. Higher eigenvalues correspond to greater amounts of
shared variance explained.
5. Select the number of principal components. Select the top N
eigenvectors (based on their eigenvalues) to become the N principal
components. The optimal number of principal components is both
subjective and problem-dependent. Usually, we look at the
cumulative amount of shared variance explained by the combination
of principal components and pick that number of components, which
still significantly explains the shared variance.

Keep in mind that the majority of data scientists will not calculate
PCA by hand, but rather implement it in Python with ScikitLearn, or
use R to compute it. These mathematical foundations enrich our

26
understanding of PCA but are not necessary for its implementation.
Understanding PCA allows us to have a better idea of its advantages
and disadvantages.

What are the advantages and disadvantages of PCA?

PCA offers multiple benefits, but it also suffers from certain


shortcomings.

Advantages of PCA:

1. Easy to compute. PCA is based on linear algebra, which is


computationally easy to solve by computers.
2. Speeds up other machine learning algorithms. Machine learning
algorithms converge faster when trained on principal components
instead of the original dataset.
3. Counteracts the issues of high-dimensional data. High-dimensional
data causes regression-based algorithms to overfit easily. By using
PCA beforehand to lower the dimensions of the training dataset, we
prevent the predictive algorithms from overfitting.

Disadvantages of PCA:

1. Low interpretability of principal components. Principal components


are linear combinations of the features from the original data, but
they are not as easy to interpret. For example, it is difficult to tell
which are the most important features in the dataset after computing
principal components.
2. The trade-off between information loss and dimensionality
reduction. Although dimensionality reduction is useful, it comes at a
cost. Information loss is a necessary part of PCA. Balancing the
trade-off between dimensionality reduction and information loss is
unfortunately a necessary compromise that we have to make when
using PCA.

27
To start PCA on the right foot, you will need to have the right tools
that help you collect data from multiple sources and prepare it for
machine learning models. Keboola covers all the steps, so you won't
have to think about the infrastructure, only about the added-value
your machine learning models will bring.

What are the assumptions and limitations of PCA?

PCA is related to the set of operations in the Pearson correlation, so


it inherits similar assumptions and limitations:

1. PCA assumes a correlation between features. If the features (or


dimensions or columns, in tabular data) are not correlated, PCA will
be unable to determine principal components.
2. PCA is sensitive to the scale of the features. Imagine we have two
features - one takes values between 0 and 1000, while the other takes
values between 0 and 1. PCA will be extremely biased towards the
first feature being the first principle component, regardless of
the actual maximum variance within the data. This is why it’s so
important to standardize the values first.
3. PCA is not robust against outliers. Similar to the point above, the
algorithm will be biased in datasets with strong outliers. This is why
it is recommended to remove outliers before performing PCA.
4. PCA assumes a linear relationship between features. The algorithm
is not well suited to capturing non-linear relationships. That’s why
it’s advised to turn non-linear features or relationships between
features into linear, using the standard methods such as log
transforms.
5. Technical implementations often assume no missing values. When
computing PCA using statistical software tools, they often assume
that the feature set has no missing values (no empty rows). Be sure to
remove those rows and/or columns with missing values, or impute

28
missing values with a close approximation (e.g. the mean of the
column).

29

You might also like