You are on page 1of 89

lOMoARcPSD|369 802 53

lOMoARcPSD|369 802 53

UNIT -3 UNSUPERVISED LEARNING AND REINFORCEMENT


LEARNING

Unsupervised Machine Learning

In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden patterns
from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in


which models are not supervised using training dataset. Instead, models itself find
the hidden patterns and insights from the given data. It can be compared to learning
which takes place in the human brain while learning new things. It can be defined
as:

Unsupervised learning is a type of machine learning in which models are trained


using unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification


problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the
underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
lOMoARcPSD|369 802 53

Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained
upon the given dataset, which means it does not have any idea about the features of
the dataset. The task of the unsupervised learning algorithm is to identify the image
features on their own. Unsupervised learning algorithm will perform this t ask by
clustering the image dataset into the groups according to similarities between
images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised
Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.

o In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.
lOMoARcPSD|369 802 53

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to
find the hidden patterns from the data and then will apply suitable algorithms such
as k-means clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of
problems:
lOMoARcPSD|369 802 53

o Clustering: Clustering is a method of grouping the objects into clusters such


that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large database. It
determines the set of items that occurs together in the dataset. Association
rule makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)

o Hierarchal clustering
o Anomaly detection
lOMoARcPSD|369 802 53

o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have labeled
input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning


as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the


unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another group."
lOMoARcPSD|369 802 53

It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns. It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm, but the


difference is the type of dataset that we are using. In classification, we work with the
labeled data set, whereas in clustering, we work with the unlabeled dataset.

Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


lOMoARcPSD|369 802 53

o Image segmentation

o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-series
to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.

Types of Clustering

1. Exclusive Clustering

2. Overlapping Clustering

3. Hierarchical Clustering
lOMoARcPSD|369 802 53

Exclusive Clustering: Exclusive Clustering is the hard clustering in which data


point exclusively belongs to one cluster.For example, K-Means Clustering.

Here you can see all similar datapoints are clustered. All the blue-colored data
points are clustered into the blue cluster and all the red-colored data points are
clustered into the red cluster.

Overlapping Clustering: Overlapping clustering is the soft cluster in which data


point belongs to multiple clusters. For example, C-Means Clustering.

In this, we can see that some of the blue data points and some of the pink data
points are overlapped.

Hierarchical Clustering: Hierarchical clustering is grouping similar objects into


groups. This forms the set of clusters in which each cluster is distinct from another
cluster and the objects within that each cluster is similar to each other.
lOMoARcPSD|369 802 53

Observe this pic .There are 6 different data points namely, A, B, C, D, E, and F.

 Coming to case1, A and B are clustered based on some similarities whereas


E and D are clustered based on some similarities.
 Coming to case2, the combination of A and B is similar to C so the
combination of A and B is grouped with C.
 Coming to case3, the combination of D and E is similar to F. So the
combination of D and E is grouped with F.
 Coming to the last case, the combination of A, B, C and combination of D,
E, F are quite similar so all these points are grouped into a single cluster.

This is how hierarchical clustering works.

Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:
lOMoARcPSD|369 802 53

10

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
lOMoARcPSD|369 802 53

11

Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.

Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering


algorithm that uses Gaussian Mixture Models (GMM).
lOMoARcPSD|369 802 53

12

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as


there is no requirement of pre-specifying the number of clusters to be created. In this
technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be
selected by cutting the tree at the correct level. The most common example of this
method is the Agglomerative Hierarchical algorithm.
lOMoARcPSD|369 802 53

13

Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used
in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).

2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in


the smooth density of data points. It is an example of a centroid-based model,
that works on updating the candidates for centroid to be the center of the points
within a given region.
lOMoARcPSD|369 802 53

14

3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of


Applications with Noise. It is an example of a density-based model similar
to the mean-shift, but with some remarkable advantages. In this algorithm, the
areas of high density are separated by the areas of low density. Because of
this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be
used as an alternative for the k-means algorithm or for those cases where K-
means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine


Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used


for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
lOMoARcPSD|369 802 53

15

o In Search Engines: Search engines also work on the clustering technique.


The search result appears based on the closest object to the search query. It
does it by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that for
what purpose the particular land should be used, that means for which purpose
it is more suitable

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science. In this topic, we will learn
what is K-means clustering algorithm, how the algorithm works.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on. It is an iterative algorithm that
divides the unlabeled dataset into k different clusters in such a way that each dataset
belongs only one group that has similar properties.
lOMoARcPSD|369 802 53

16

It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.

It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of distances
between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.T he below diagram explains the working of the K-means Clustering
Algorithm:
lOMoARcPSD|369 802 53

17

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
lOMoARcPSD|369 802 53

18

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.

o We need to choose some random k points or centroid to form the cluster.


These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset.
o Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-pointer
centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw median
between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
lOMoARcPSD|369 802 53

19

o As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each data point to the new centroid. For this, we
willrepeat the same process of finding a median line. The median will be
like below image:

From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.

As reassignment has taken place, so we will again go to the step -4, which is finding
new centroids or K-points. We will repeat the process by finding the center of gravity
lOMoARcPSD|369 802 53

20

of centroids, so the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below
image:

As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the image:
lOMoARcPSD|369 802 53

21

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly


efficient clusters that it forms. But choosing the optimal number of clusters is a
big task. There are some different ways to find the optimal number of clusters,
but here we are discussing the most appropriate method to find the number of
clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in


CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values


(ranges from 1-10).
lOMoARcPSD|369 802 53

22

o For each value of K, calculates the WCSS value.


o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point
is considered as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known
as the elbow method. The graph for the elbow method looks like the below image:

Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot.

Hierarchical Clustering in Machine Learning

Hierarchical clustering is another unsupervised machine learning algorithm, which


is used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
lOMoARcPSD|369 802 53

23

Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the


algorithm starts with taking all data points as single clusters and merging them
until one cluster is left.
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as
it is a top-down approach.

Why hierarchical clustering?

As we already have other clustering algorithms such as K-Means Clustering, then


why we need hierarchical clustering? So, as we have seen in the K-means clustering
that there are some challenges with this algorithm, which are a predetermined
number of clusters, and it always tries to create the clusters of the same size. To
solve these two challenges, we can opt for the hierarchical clustering algorithm
because, in this algorithm, we don't need to have knowledge about the predefined
number of clusters.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of HCA.


To group the datasets into clusters, it follows the bottom-up approach. It means,
this algorithm considers each dataset as a single cluster at the beginning, and then
start combining the closest pair of clusters together. It does this until all the clusters
are merged into a single cluster that contains all the datasets.
lOMoARcPSD|369 802 53

24

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the
lOMoARcPSD|369 802 53

25

following clusters. Consider the below images:

o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest p oints of the
clusters. Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
lOMoARcPSD|369 802 53

26

3. Average Linkage: It is the linkage method in which the distance between


each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters. It is also one
of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between
the centroid of the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type
of problem or business requirement.

Working of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.

The working of the dendrogram can be explained using the below diagram:
lOMoARcPSD|369 802 53

27

In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine


together and form a cluster, correspondingly a dendrogram is created, which
connects P2 and P3 with a rectangular shape. The hight is decided according
to the Euclidean distance between the data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram
is created. It is higher than of previous, as the Euclidean distance between P5
and P6 is a little bit greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points
together.

We can cut the dendrogram tree structure at any level as per our requirement
lOMoARcPSD|369 802 53

28

Cluster Validity

For cluster analysis, the analogous question is how to evaluate the “goodness” of
the resulting clusters?

Different Aspects of Cluster Validation

1. Determining the clustering tendency of a set of data, i.e., distinguishing


whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information. --Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.

Why do we need cluster validity indices?


 To compare clustering algorithms.
 To compare two sets of clusters.
 To compare two clusters i.e (which one is better in terms of compactness
and connectedness).
 To determine whether random structure exists in the data due to noise.
Generally, cluster validity measures are categorized into 3 classes.
lOMoARcPSD|369 802 53

29

1. Internal cluster validation: The clustering result is evaluated based on


the data clustered itself (internal information) without reference to
external information.
2. External cluster validation: Clustering results are evaluated based on
some externally known result, such as externally provided class labels.
3. Relative cluster validation: The clustering results are evaluated by
varying different parameters for the same algorithm (e.g. changing the
number of clusters).
Besides the term cluster validity index, we need to know about inter-cluster
distance d(a, b) between two cluster a, b and intra-cluster index D(a) of cluster a.
Inter-cluster distance d(a, b) between two clusters a and b can be –

 Single linkage distance: Closest distance between two objects belonging


to a and b respectively.
 Complete linkage distance: Distance between two most remote objects
belonging to a and b respectively.
 Average linkage distance: Average distance between all the objects
belonging to a and b respectively.
 Centroid linkage distance: Distance between the centroid of the two
clusters a and b respectively.
Intra-cluster distance D(a) of a cluster a can be –
 Complete diameter linkage distance: Distance between two farthest
objects belonging to cluster a.
 Average diameter linkage distance: Average distance between all the
objects belonging to cluster a.
 Centroid diameter linkage distance: Twice the average distance
between all the objects and the centroid of the cluster a.
lOMoARcPSD|369 802 53

30

Recall: evaluating K-means clusters


Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci


 can show that mi corresponds to the center (mean) of the cluster
Given two sets of clusters, we prefer the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
 A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K
Internal Measures: Cohesion and Separation
Cluster Cohesion: Measures how closely related are objects in a cluster
 Example: SSE
Cluster Separation: Measure how distinct or well-separated a cluster is from other
clusters
 Example: Squared Error, overall
Cohesion is measured by the within-cluster sum of squares

=
Separation is measured by the between-cluster sum of squares

=
Where |Ci| is the size of cluster i and m is the mean of the means.
lOMoARcPSD|369 802 53

31

Note that BSS+ WSS = constant

A proximity graph-based approach can also be used for cohesion and separation.
 Cluster cohesion is the sum of the weight of all links within a cluster.
 Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.

Now, let’s discuss 2 internal cluster validity indices namely Dunn index and DB
index
Dunn index:
The Dunn index (DI) (introduced by J. C. Dunn in 1974), a metric for
evaluating clustering algorithms, is an internal evaluation scheme, where the
result is based on the clustered data itself. Like all other such indices, the aim of
this Dunn index to identify sets of clusters that are compact, with a small
variance between members of the cluster, and well separated, where the means
lOMoARcPSD|369 802 53

32

of different clusters are sufficiently far apart, as compared to the within


cluster variance.

Higher the Dunn index value, better is the clustering. The number of
clusters that maximizes Dunn index is taken as the optimal number of clusters k. It
also has some drawbacks. As the number of clusters and dimensionality of the data
increase, the computational cost also increases. The Dunn index for c number of
clusters is defined:

DB index:
The Davies–Bouldin index (DBI) (introduced by David L. Davies and
Donald W. Bouldin in 1979), a metric for evaluating clustering algorithms, is an
internal evaluation scheme, where the validation of how well the clustering has
been done is made using quantities and features inherent to the dataset.

Lower the DB index value, better is the clustering. It also has a drawback.
A good value reported by this method does not imply the best information
lOMoARcPSD|369 802 53

33

retrieval.
The DB index for k number of clusters is defined as :

Many interesting algorithms are applied to analyze very large datasets. Most
algorithms don’t provide any means for its validation and evaluation. So it is very
difficult to conclude which are the best clusters and should be taken for analysis.

There are several indices for predicting optimal clusters –

1. Silhouette Index
2. Dunn Index
3. DB Index
4. CS Index
5. I- Index
6. XB or Xie Beni Index
lOMoARcPSD|369 802 53

34

Silhouette Index –

Silhouette analysis refers to a method of interpretation and validation of


consistency within clusters of data.

The silhouette value is a measure of how similar an object is to its own


cluster (cohesion) compared to other clusters (separation). It can be used to study
the separation distance between the resulting clusters.

The silhouette plot displays a measure of how close each point in one cluster
is to points in the neighboring clusters and thus provides a way to assess parameters
like number of clusters visually.

How Silhouette Analysis Works ?

The Silhouette validation technique calculates the silhouette index for each sample,
average silhouette index for each cluster and overall average silhouette index for
a dataset. Using the approach each cluster could be represented by Silhouette
index, which is based on the comparison of its tightness and separation.

Calculation of Silhouette Value –

If the Silhouette index value is high, the object is well-matched to its own
cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is
calculated using the mean intra-cluster distance (a) and the mean nearest-cluster
distance (b) for each sample. The Silhouette Coefficient is defined as –

S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }


Where,
lOMoARcPSD|369 802 53

35

 a(i) is the average dissimilarity of ith object to all other objects in the same
cluster
 b(i) is the average dissimilarity of ith object with all objects in the closest
cluster.
Range of Silhouette Value –
Now, obviously S(i) will lie between [-1, 1] –
1. If silhouette value is close to 1, sample is well-clustered and already
assigned to a very appropriate cluster.
2. If silhouette value is about to 0, sample could be assign to another cluster
closest to it and the sample lies equally far away from both the clusters.
That means it indicates overlapping clusters
3. If silhouette value is close to –1, sample is misclassified and is merely
placed somewhere in between the clusters

EXTERNAL INDEX
There are different metrics used to evaluate the performance of a clustering
model or clustering quality.
 Purity
 Normalized mutual information (NMI)
 Rand index
Purity
Purity is quite simple to calculate. We assign a label to each cluster based on the most
frequent class in it. Then the purity becomes the number of correctly matched class
and cluster labels divided by the number of total data points. Consider a case where
lOMoARcPSD|369 802 53

36

our clustering model groups the data p oints into 3 clusters as seen below:

Each cluster is assigned with the most frequent class label. We sum the number of
correct class labels in each cluster and divide it by the total number of data points.

In general, purity increases as the number of clusters increases. For instance, if we


have a model that groups each observation in a separate cluster, the purity becomes
one.
For this very reason, purity cannot be used as a trade off between the number of
clusters and clustering quality.
Normalized mutual information (NMI)
NMI is related to the information theory. We need to understand what entropy is so
I will briefly explain it first.
Entropy is a measure that quantifies uncertainty.

Pi is the probability of the label i (P(i)). Let’s calculate the entropy of the class labels
in the previous examples.
lOMoARcPSD|369 802 53

37

We can calculate the probability of a class label by dividing the number of data points
belong to that class to the total number of data points. For instance, probability of
class A is 6 / 18.
The entropy in our case is calculated as below. If you run the calculation, you will
see that the result is 1.089.

We can now introduce the formula for rand index:

Rand index

 a is the number of times a pair of elements are in the same cluster for both
actual and predicted clustering which we calculate as 2.
 b is the number of times a pair of elements are not in the same cluster for
both actual and predicted clustering which we calculate as 8.
 The expression in the denominator is the total number of binomial
coefficients which is 15.
Thus, rand index in this case is 10 / 15 = 0.6

Introduction to Dimensionality Reduction Technique

What is Dimensionality Reduction: The number of input features, variables, or


columns present in a given dataset is known as dimensionality, and the process
to reduce these features is called dimensionality reduction.
lOMoARcPSD|369 802 53

38

A dataset contains a huge number of input features in various cases, which makes
the predictive modeling task more complicated. Because it is very difficult to
visualize or make predictions for the training dataset with a high number of features,
for such cases, dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting


the higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification
and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be used
for data visualization, noise reduction, cluster analysis, etc.
lOMoARcPSD|369 802 53

39

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly


known as the curse of dimensionality. If the dimensionality of the input dataset
increases, any machine learning algorithm and model becomes more complex. As
the number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases. If the machine learning
model is trained on high-dimensional data, it becomes overfitted and results in
poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset


are given below:

o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.

Disadvantages of dimensionality Reduction


lOMoARcPSD|369 802 53

40

There are also some disadvantages of applying the dimensionality reduction, which
are given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.

The importance of dimensionality reduction

 Lower number of dimensions in data means less training time and


less computational resources and increases the overall performance
of machine learning algorithms

 Dimensionality reduction Avoid overfitting

 Dimensionality reduction is extremely useful for data visualization.

 Dimensionality reduction takes care of multicollinearity —

 Dimensionality reduction is very useful for factor analysis — This is a


useful approach to find latent variables which are not directly measured
in a single variable but rather inferred from other variables in the dataset.
These latent variables are called factors.

 Dimensionality reduction removes noise in the data

 Dimensionality reduction can be used for image compression


 Dimensionality reduction can be used to transform non-linear data
into a linearly-separable form
lOMoARcPSD|369 802 53

41

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given
below:

Feature Selection Feature selection is the process of selecting the subset of the
relevant features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy. In other words, it is a way of selecting the optimal
features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML
model, and evaluate the performance. The performance decides whether to add
those features or remove to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common techniques
of wrapper methods are:
lOMoARcPSD|369 802 53

42

o Forward Selection
o Backward Selection

o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training


iterations of the machine learning model and evaluate the importance of each
feature. Some common techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:

Feature extraction is the process of transforming the space containing many


dimensions into space with fewer dimensions. This approach is useful when we want
to keep the whole information but use fewer resources while processing the
information.Some common feature extraction techniques are:

a) Principal Component Analysis


b) Linear Discriminant Analysis
c) Kernel PCA

d) Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction

a. Principal Component Analysis

b. Backward Elimination
lOMoARcPSD|369 802 53

43

c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis

j. Auto-Encoder

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the observations


of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis
and predictive modeling.

PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels.

Backward Feature Elimination

The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this
technique to reduce the dimensionality or in feature selection:
lOMoARcPSD|369 802 53

44

o In this technique, firstly, all the n variables of the given dataset are taken to
train the model.
o The performance of the model is checked.

o Now we will remove one feature each time and train the model on n-1 features
for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.

o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the
machine learning algorithms.

Forward Feature Selection

Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will
find the best features that can produce the highest increase in the performance of the
model. Below steps are performed in this technique:

o We start with a single feature only, and progressively we will add each feature
at a time.
o Here we will train the model on each feature separately.

o The feature with the best performance is selected.


o The process will be repeated until we get a significant increase in the
performance of the model.
lOMoARcPSD|369 802 53

45

Missing Value Ratio

If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if
a variable has missing values more than that threshold, we will drop that variable.
The higher the threshold value, the more efficient the reduction.

Low Variance Filter

As same as missing value ratio technique, data columns with some changes in the
data have less information. Therefore, we need to calculate the variance of each
variable, and all data columns with variance lower than a given threshold are
dropped because low variance features will not affect the target variable.

High Correlation Filter

High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value
of the correlation coefficient. If this value is higher than the threshold value, we can
remove one of the variables from the dataset. We can consider those variables or
features that show a high correlation with the target variable.

Random Forest

Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do
not need to program it separately. In this technique, we need to generate a large set
of trees against the target variable, and with the help of usage statistics of each
attribute, we need to find the subset of features.
lOMoARcPSD|369 802 53

46

Random forest algorithm takes only numerical variables, so we need to convert the
input data into numeric data using hot encoding.

Factor Analysis

Factor analysis is a technique in which each variable is kept within a group according
to the correlation with other variables, it means variables within a group can have a
high correlation between themselves, but they have a low correlation with variables
of other groups.

We can understand it by an example, such as if we have two variables Income and


spend. These two variables have a high correlation, which means people with high
income spends more, and vice versa. So, such variables are put into a group, and that
group is known as the factor. The number of these factors will be reduced as
compared to the original dimension of the dataset.

Auto-encoders

One of the popular methods of dimensionality reduction is auto-encoder, which is a


type of ANN or artificial neural network, and its main aim is to copy the inputs to
their outputs. In this, the input is compressed into latent-space representation, and
output is occurred using this representation. It has mainly two parts:

o Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the latent-
space representation.
lOMoARcPSD|369 802 53

47

Principal Component Analysis

Principal Component Analysis is an unsupervised learning algorithm that is used for


the dimensionality reduction in machine learning. It is a statistical process that
converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components. It is one of the popular tools that is used for
exploratory data analysis and predictive modeling. It is a technique to draw strong
patterns from the given dataset by reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.

PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:

o Variance and Covariance


o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given


dataset. More easily, it is the number of columns present in the dataset.
lOMoARcPSD|369 802 53

48

o Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.
Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.

Principal Components in PCA

As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of these principal
components are given below:

o The principal component must be the linear combination of the original


features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 to n, it means
the 1 PC has the most importance, and n PC will have the least importance.
lOMoARcPSD|369 802 53

49

Steps for PCA algorithm


1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X
and Y, where X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent
the two-dimensional matrix of independent variable X. Here each row
corresponds to the data items, and the column corresponds to the Features.
The number of columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column,
the features with high variance are more important compared to the features
with lower variance.
If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation
of the column. Here we will name the matrix as Z.

4. Calculating the Covariance of Z


To calculate the covariance of Z, we will take the matrix Z, and will
transpose it. After transpose, we will multiply it by Z. The output matrix will
be the Covariance matrix of Z.
lOMoARcPSD|369 802 53

50

5. Calculating the Eigen Values and Eigen Vectors


Now we need to calculate the eigenvalues and eigenvectors for the resultant
covariance matrix Z. Eigenvectors or the covariance matrix are the
directions of the axes with high information. And the coefficients of these
eigenvectors are defined as the eigenvalues.

Geometrically speaking, principal components represent the directions of the data


that explain a maximal amount of variance, that is to say, the lines that capture
most information of the data. The relationship between variance and information
here, is that, the larger the variance carried by a line, the larger the dispersion of
the data points along it, and the larger the dispersion along a line, the more the
information it has. To put all this simply, just think of principal components as new
axes that provide the best angle to see and evaluate the data, so that the differences
between the observations are better visible

As there are as many principal components as there are variables in the data,
principal components are constructed in such a manner that the first principal
lOMoARcPSD|369 802 53

51

component accounts for the largest possible variance in the data set. For example,
let’s assume that the scatter plot of our data set is as shown below, can we guess
the first principal component ? Yes, it’s approximately the line that matches the
purple marks because it goes through the origin and it’s the line in which the
projection of the points (red dots) is the most spread out. Or mathematically
speaking, it’s the line that maximizes the variance (the average of the squared
distances from the projected points (red dots) to the origin).

The second principal component is calculated in the same way, with the condition
that it is uncorrelated with (i.e., perpendicular to) the first principal component and
that it accounts for the next highest variance.

This continues until a total of p principal components have been calculated, equal
to the original number of variables.

Now that we understand what we mean by principal components, let’s go back to


eigenvectors and eigenvalues. What you first need to know about them is that they
always come in pairs, so that every eigenvector has an eigenvalue. And their
number is equal to the number of dimensions of the data. For example, for a 3 -
lOMoARcPSD|369 802 53

52

dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3
corresponding eigenvalues.

Without further ado, it is eigenvectors and eigenvalues who are behind all the
magic explained above, because the eigenvectors of the Covariance matrix are
actually the directions of the axes where there is the most variance(most
information) and that we call Principal Components. And eigenvalues are simply
the coefficients attached to eigenvectors, which give the amount of variance
carried in each Principal Component.By ranking your eigenvectors in order of
their eigenvalues, highest to lowest, you get the principal components in order of
significance

6. Sorting the Eigen Vectors


In this step, we will take all the eigenvalues and will sort them in decreasing
order, which means from largest to smallest. And simultaneously sort the
eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix
will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P*
matrix to the Z. In the resultant matrix Z*, each observation is the linear
combination of original features. Each column of the Z* matrix is
independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and
what to remove. It means, we will only keep the relevant or important
features in the new dataset, and unimportant features will be removed out.
lOMoARcPSD|369 802 53

53

Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI


applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc

What Are Recommendation Systems in Machine Learning?

Recommender systems are the systems that are designed to recommend things to the
user based on many different factors. These systems predict the most likely
product that the users are most likely to purchase and are of interest to.
Companies like Netflix, Amazon, etc. use recommender systems to help their users
to identify the correct product or movies for them.
The recommender system deals with a large volume of information present by
filtering the most important information based on the data provided by a user and
other factors that take care of the user’s preference and interest. It finds out the
match between user and item and imputes the similarities between users and items
for recommendation.
Both the users and the services provided have benefited from these kinds of systems.
The quality and decision-making process has also improved through these kinds of
systems.
Why the Recommendation system?
 Benefits users in finding items of their interest.
 Help item providers in delivering their items to the right user.
 Identity products that are most relevant to users.
 Personalized content.
 Help websites to improve user engagement.
lOMoARcPSD|369 802 53

54

What can be Recommended?


There are many different things that can be recommended by the system like movies,
books, news, articles, jobs, advertisements, etc. Netflix uses a recommender system
to recommend movies & web-series to its users. Similarly, YouTube recommends
different videos. There are many examples of recommender systems that are widely
used today.
How do User and Item matching is done?
In order to understand how the item is recommended and how the matching is done,
let us a look at the images below;

Showing user-item matching for social websites


Perfect matching may not be recommended
lOMoARcPSD|369 802 53

55

Real-life user interaction with a recommendations system


The above pictures show that there won't be any perfect recommendation which is
made to a user. In the above image, a user has searched for a laptop with 1TB HDD,
8GB ram, and an i5 processor for 40,000₹. The system has recommended 3 most
similar laptops to the user.

Types of Recommendation System

1. Popularity-Based Recommendation System


It is a type of recommendation system which works on the principle of popularity
and or anything which is in trend. These systems check about the product or movie
which are in trend or are most popular among the users and directly
recommend those.
For example, if a product is often purchased by most people then the system will get
to know that that product is most popular so for every new user who just signed it,
the system will recommend that product to that user also and chances becomes high
that the new user will also purchase that.
Merits of popularity based recommendation system
lOMoARcPSD|369 802 53

56

 It does not suffer from cold start problems which means on day 1 of the
business also it can recommend products on various different filters.
 There is no need for the user's historical data.
Demerits of popularity based recommendation system
 Not personalized
 The system would recommend the same sort of products/movies which are
solely based upon popularity to every other user.
Example
 Google News: News filtered by trending and most popular news.
 YouTube: Trending videos.

2. Classification Model
The model that uses features of both products as well as users to predict whether a
user will like a product or not.

Classification model
The output can be either 0 or 1. If the user likes it then 1 and vice-versa.
lOMoARcPSD|369 802 53

57

Limitations of Classification Model

It is a rigorous task to collect a high volume of information about different users and
also products.

 Also, if the collection is done then also it can be difficult to classify.


 Flexibility issue.

3. Content-Based Recommendation System


It is another type of recommendation system which works on the principle of
similar content. If a user is watching a movie, then the system will check about other
movies of similar content or the same genre of the movie the user is watching. There
are various fundamentals attributes that are used to compute the similarity while
checking about similar content.
To explain more about how exactly the system works, an example is stated below:

Figure1: Different models of one plus.


Figure 1 image shows the different models of one plus phone. If a person is looking
for oneplus7 mobile then, oneplus7T and one plus7 Pro is recommended to the user.

But how is it recommended?


lOMoARcPSD|369 802 53

58

To check the similarity between the products or mobile phone in this example,
the system computes distances between them. One plus 7 and One plus 7T both have
8Gb ram and 48MP primary camera.
If the similarity is to be checked between both the products, Euclidean distance is
calculated. Here, distance is calculated based on ram and camera;

Euclidean distance (7T,7)

Euclidean distance (7Pro,7)


Euclidean distance between (7T,7) is 0 whereas Euclidean distance between (7pro,7)
is 4 which means one plus 7 and one plus 7T have similarities in them whereas one
plus 7Pro and 7 are not similar products.
In order to explain the concept through this example, only the basic thing
(camera and ram) was taken but there is no restriction. We can compute distance
calculation for any of the features of the product. The basic principle remains the
same if the distance between both is 0, they are likely to have similar content.
There are different scenarios where we need to check about the similarities,
so there are different metrics to be used. For computing the similarity between
numeric data, Euclidean distance is used, for textual data, cosine similarity is
calculated and for categorical data, Jaccard similarity is computed.
lOMoARcPSD|369 802 53

59

Euclidean Distance: Distance between two points can be calculated by the


equation;

The formula for Euclidean distance

Cosine Similarity: Cosine of the angle between the two vectors of the item, vectors
of A and B is calculated for imputing similarity. If the vectors are closer, then small
will be the angle and large will be the cosine.

Cosine Similarity
Jaccard Similarity: Users who have rated item A and B divided by the total number
of users who have rated either A or B gives us the similarity. It is used for comparing
the similarity.

Jaccard Similarity

Merits

 There is no requirement for much of the user’s data.


 We just need item data that enable us to start giving recommendations to users.
 A content-based recommender engine does not depend on the user’s data, so
even if a new user comes in, we can recommend the user as long as we have
the user data to build his profile.
 It does not suffer from a cold start.
lOMoARcPSD|369 802 53

60

Demerits

 Items data should be in good volume.


 Features should be available to compute the similarity.

4. Collaborative Filtering
It is considered to be one of the very smart recommender systems that work
on the similarity between different users and also items that are widely used as an e-
commerce website and also online movie websites. It checks about the taste of
similar users and does recommendations.
The similarity is not restricted to the taste of the user moreover there can be
consideration of similarity between different items also. The system will give more
efficient recommendations if we have a large volume of information about users and
items.

Concept of collaborative filtering.


lOMoARcPSD|369 802 53

61

Figure 2 shows the two different users and their interests along with the
similarity between the taste of both the users. It is found that both Jil and Megan
have similar tastes so Jill's interest is recommended to Megan and vice versa.
This is the way collaborative filtering works. Mainly, there are two
approaches used in collaborative filtering stated below;
a) User-based nearest-neighbor collaborative filtering

Figure 3: User-User Collaborative filtering


Figure 3 shows user-user collaborative filtering where there are three users A,
B and C respectively and their interest in fruit. The system finds out the users who
have the same sort of taste of purchasing products and similarity between users
is computed based upon the purchase behavior. User A and User C are similar
because they have purchased similar products.
b) Item-based nearest-neighbor collaborative filtering
lOMoARcPSD|369 802 53

62

Figure 4: Item-Item Collaborative filtering.


Figure 4 shows user X, Y, and Z respectively. The system checks the items
that are similar to the items the user bought. The similarity between different items
is computed based on the items and not the users for the prediction. Users X and Y
both purchased items A and B so they are found to have similar tastes.

Limitations

 Enough users required to find a match. To overcome such cold start problems,
often hybrid approaches are made use of between CF and Content -based
matching.
 Even if there are many users and many items that are to be recommended
often, problems can arise of user and rating matrix to be sparse and will
become challenging to find out about the users who have rated the same item.
 The problem in recommending items to the user due to sparsity problems.

c) Singular value decomposition and matrix-factorization

Singular value decomposition also known as the SVD algorithm is used as a


collaborative filtering method in recommendation systems. SVD is a matrix
factorization method that is used to reduce the features in the data by reducing
the dimensions from N to K where (K<N).
lOMoARcPSD|369 802 53

63

For the part of the recommendation, the only part which is taken care of is
matrix factorization that is done the user-item rating matrix. Matrix-factorization
is all about taking 2 matrices whose product is the original matrix. Vectors are used
to represent item ‘qi’ and user ‘pu’ such that their dot product is the expected
rating.

The Formula for an expected rating


‘qi’ and ‘pu’ can be calculated in such a way that the square error difference
between the dot product of user and item and the original ratings in the user-item
matrix is least.

The formula for regularization without regularization factor

Regularization: Avoiding overfitting of the model is an important aspect of any


machine learning model because it results in low accuracy of the model.
Regularization eliminates the risk of models being overfitted.
For this purpose in regularization, a penalty term is introduced to the above
minimization equation. λ is the regularization factor which is multiplied by the
square sum of the magnitudes of user and item vectors.

The formula for regularization with regularization factor


lOMoARcPSD|369 802 53

64

To understand and explore the importance of the factor which is introduced


above, let's consider a case where a user has rated a very low rating to a movie and
has not rated any other movie except that.
The above algorithm will reduce the error by imputing ‘qi’ a bigger value
which will result in all ratings to all the movies be low.
This is instinctive wrong. Assigning the large value to vectors and adding the
magnitude of the vectors to the equation will reduce the equation and thus the
situation will not arise.
Bias terms: Algorithms make use of features of the data to minimize the error
between the actual value and the predicted value. To be specific for each user-user
item u & i, we can pull three parameters; ‘bu’ (Ratings are given by the user u which
tell about the expected rating), ‘µ’ (Ratings of all items) & ‘bi’ (the rating of item i
- µ).

Bias Term
The minimized equation is,

The minimized equation for the above formula

Minimizing with Stochastic Gradient Descent (SGD): SGD is used to reduce the
above equation. SGD functions by taking the parameters of the equation which we
are trying to reduce to initial values and then iterating it to minimize the incorrect
error between the actual value & the predicted value by making the use of a small
factor each time to correct.
lOMoARcPSD|369 802 53

65

SGD makes the usage of the learning rate to check about the previous values
and the new value after every other iteration.

EM Algorithm in Machine Learning

The EM algorithm is considered a latent variable model to find the local maximum
likelihood parameters of a statistical model, proposed by Arthur Dempster, Nan
Laird, and Donald Rubin in 1977. The EM (Expectation-Maximization) algorithm
is one of the most commonly used terms in machine learning to obtain maximum
likelihood estimates of variables that are sometimes observable and sometimes
not. However, it is also applicable to unobserved data or sometimes called latent.

In most real-life applications of machine learning, it is found that several


relevant learning features are available, but very few of them are observable, and the
rest are unobservable. If the variables are observable, then it can predict the
value using instances. On the other hand, the variables which are latent or directly
not observable, for such variables Expectation-Maximization (EM) algorithm
plays a vital role to predict the value with the condition that the general form of
probability distribution governing those latent variables is known to us.

What is an EM algorithm?

The Expectation-Maximization (EM) algorithm is defined as the combination of


various unsupervised machine learning algorithms, which is used to determine
the local maximum likelihood estimates (MLE) or maximum a posteriori
lOMoARcPSD|369 802 53

66

estimates (MAP) for unobservable variables in statistical models. Further, it is a


technique to find maximum likelihood estimation when the latent variables are
present. It is also referred to as the latent variable model.

A latent variable model consists of both observable and unobservable


variables where observable can be predicted while unobserved are inferred from the
observed variable. These unobservable variables are known as latent variables.

Key Points:

o It is known as the latent variable model to determine MLE and MAP


parameters for latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values
occurs.

EM Algorithm

The EM algorithm is the combination of various unsupervised ML algorithms, such


as the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is
referred to as the Expectation/estimation step (E-step). Further, the other mode is
used to optimize the parameters of the models so that it can explain the data more
clearly. The second mode is known as the maximization-step or M-step.
lOMoARcPSD|369 802 53

67

o Expectation step (E - step): It involves the estimation (guess) of all missing


values in the dataset so that after completing this step, there should not be any
missing value.

o Maximization step (M - step): This step involves the use of estimated data
in the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.

What is Convergence in the EM algorithm?

Convergence is defined as the specific situation in probability based on intuition,


e.g., if there are two random variables that have very less difference in their
probability, then they are known as converged. In other words, whenever the values
of given variables are matched with each other, it is called convergence.

Steps in EM Algorithm
lOMoARcPSD|369 802 53

68

o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to


estimate or guess the values of the missing or incomplete data using the
observed data. Further, E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use
complete data obtained from the 2nd step to update the parameter values.
Further, M-step primarily updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the
process from step 2 until the convergence occurs.

Gaussian Mixture Model (GMM)

The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM
also requires estimated statistics values such as mean and standard deviation or
parameters. It is used to estimate the parameters of the probability distributions to
best fit the density of a given training dataset. Although there are plenty of
techniques available to estimate the parameter of the Gaussian Mixture Model
(GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.

Let's understand a case where we have a dataset with multiple data points generated
by two different processes. However, both processes contain a similar Gaussian
lOMoARcPSD|369 802 53

69

probability distribution and combined data. Hence it is very difficult to discriminate


which distribution a given point may belong to.

The processes used to generate the data point represent a latent variable or
unobservable data. In such cases, the Estimation-Maximization algorithm is one of
the best techniques which helps us to estimate the parameters of the gaussian
distributions. In the EM algorithm, E-step estimates the expected value for each
latent variable, whereas M-step helps in optimizing them significantly using the
Maximum Likelihood Estimation (MLE). Further, this process is repeated until a
good set of latent values, and a maximum likelihood is achieved that fits the data.

Applications of EM algorithm

The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable
model has a broad range of real-life applications in machine learning. These are as
follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as
the Gaussian Mixture Model and quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent
abilities of item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.

Advantages of EM algorithm
lOMoARcPSD|369 802 53

70

o It is very easy to implement the first two basic steps of the EM algorithm in
various machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.

o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm

o The convergence of the EM algorithm is very slow.


o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is
opposite to that of numerical optimization, which takes only forward
probabilities.

Conclusion

In real-world applications of machine learning, the expectation-maximization (EM)


algorithm plays a significant role in determining the local maximum likelihood
estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. It is often used for the latent variables, i.e., to estimate
the latent variables through observed data in datasets. It is generally completed in
two important steps, i.e., the expectation step (E-step) and the Maximization step
(M-Step), where E-step is used to estimate the missing data in datasets, and M-step
is used to update the parameters after the complete data is generated in E-step.
Further, the importance of the EM algorithm can be seen in various applications such
as data clustering, natural language processing (NLP), computer vision, image
reconstruction, structural engineering, etc
lOMoARcPSD|369 802 53

71

Reinforcement Learning

o Reinforcement Learning is a feedback-based Machine learning technique in


which an agent learns to behave in an environment by performing the actions
and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.

o In Reinforcement Learning, the agent learns automatically using feedbacks


without any labeled data, unlike supervised learning.
o Since there is no labeled data, so the agent is bound to learn by its experience
only.
o RL solves a specific type of problem where decision making is sequential, and
the goal is long-term, such as game-playing, robotics, etc.
o The agent interacts with the environment and explores it by itself. The primary
goal of an agent in reinforcement learning is to improve the performance by
getting the maximum positive rewards.
o The agent learns with the process of hit and trial, and based on the experience,
it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where
an intelligent agent (computer program) interacts with the environment and
learns to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.

o It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as
it learns from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
lOMoARcPSD|369 802 53

72

by performing some actions, and based on those actions, the state of the agent
gets changed, and it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the agent
gets a positive point, and as a penalty, it gets a negative point.

Terms used in Reinforcement Learning

o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In
RL, we assume the stochastic environment, which means it is random in
nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action
taken by the agent.
lOMoARcPSD|369 802 53

73

o Reward(): A feedback returned to the agent from the environment to evaluate


the action of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on
the current state.
o Value(): It is expected long-term retuned with the discount factor and
opposite to the short-term reward.
o Q-value(): It is mostly similar to the value, but it takes one additional
parameter as a current action (a).

Key Features of Reinforcement Learning

o In RL, the agent is not instructed about the environment and what actions need
to be taken.

o It is based on the hit and trial process.


o The agent takes the next action and changes states according to the feedback
of the previous action.
o The agent may get a delayed reward.
o The environment is stochastic, and the agent needs to explore it to reach to get
the maximum positive rewards.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which
is the maximum value at a state under any policy. Therefore, the agent expects
the long-term return at any state(s) under policy π.
lOMoARcPSD|369 802 53

74

2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries
to apply such a policy that the action performed in each step helps to
maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any
state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model
representation is different for each environment.

Elements of Reinforcement Learning

There are four main elements of Reinforcement Learning, which are given below:

1. Policy
2. Reward Signal
3. Value Function

4. Model of the environment

1) Policy: A policy can be defined as a way how an agent behaves at a given time.
It maps the perceived states of the environment to the actions taken on those states.
A policy is the core element of the RL as it alone can define the behavior of the
agent. In some cases, it may be a simple function or a lookup table, whereas, for
lOMoARcPSD|369 802 53

75

other cases, it may involve general computation as a search process. It could be


deterministic or a stochastic policy:

For deterministic policy: a = π(s)


For stochastic policy: π(a | s) = P[At =a | St = s]

2) Reward Signal: The goal of reinforcement learning is defined by the reward


signal. At each state, the environment sends an immediate signal to the learning
agent, and this signal is known as a reward signal. These rewards are given
according to the good and bad actions taken by the agent. The agent's main objective
is to maximize the total number of rewards for good actions. The reward signal can
change the policy, such as if an action selected by the agent leads to low reward,
then the policy may change to select other actions in the future.

3) Value Function: The value function gives information about how good the
situation and action are and how much reward an agent can expect. A reward
indicates the immediate signal for each good and bad action, whereas a value
function specifies the good state and action for the future. The value function
depends on the reward as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.

4) Model: The last element of reinforcement learning is the model, which mimics
the behavior of the environment. With the help of the model, one can make
inferences about how the environment will behave. Such as, if a state and an action
are given, then a model can predict the next state and reward.

The model is used for planning, which means it provides a way to take a course of
action by considering all future situations before actually experiencing those
situations. The approaches for solving the RL problems with the help of the
lOMoARcPSD|369 802 53

76

model are termed as the model-based approach. Comparatively, an


approach without using a model is called a model-free approach

RL — Model-based Reinforcement Learning

Reinforcement learning RL maximizes rewards for our actions. From the


equations below, rewards depend on the policy and the system dynamics (model).

In Model-free RL, we ignore the model. We depend on sampling and simulation to


estimate rewards so we don’t need to know the inner working of the system. In
Model-based RL, if we can define a cost function ourselves, we can calculate the
optimal actions using the model directly.
RL can be roughly divided into Model-free and Model-based methods. In this
article, we will discuss how to establish a model and use it to make the best
decisions.
Terms
Control theory has a strong influence on Model-based RL. Therefore, let’s go
through some of the terms first.
In reinforcement learning, we find an optimal policy to decide actions. In
control theory, we optimize a controller.
lOMoARcPSD|369 802 53

77

Control is just another term for action in RL. An action is often written
as a or u with states as s or x. A controller uses a model (the system dynamics) to
decide the controls in an optimal trajectory which is expressed as a sequence of
states and controls.

In model-based RL, we optimize the trajectory for the least cost instead of the
maximum rewards.

Model-free RL v.s. Model-based RL

As mentioned before, Model-free RL ignores the model and care less about the
inner working. We fall back to sampling to estimate rewards.

We use Policy Gradients, Value Learning or other Model-free RL to find a policy


that maximizes rewards.

On the contrary, Model-based RL focuses on the model.

With a cost function, we find an optimal trajectory with the lowest cost.
lOMoARcPSD|369 802 53

78

Known models

In many games, like GO, the rule of the game is the model.

AlphaGO

In other cases, it can be the law of Physics. Sometimes, we know how to model it
and build simulators for it.
lOMoARcPSD|369 802 53

79

Source: Vehicle dynamics model & Kinematic Bicycle Model

Mathematically, the model predicts the next state.

We can define this model with rules or equations. Or, we can model it, like using
the Gaussian Process, Gaussian Mixture Model (GMM) or deep networks. To fit
these models, we run a controller to collect sample trajectories and train the models
with supervised learning.
Motivation

Model-based RL has a strong advantage of being sample efficient. Many models


behave linearly at least in the local proximity. This requires very few samples to learn
them. Once the model and the cost function are known, we can plan the optimal
controls without further sampling. As shown below, On-policy Gradient methods
can take 10M training iterations while Model-based RL is in the range of hundreds.
To train a physical robot for a simple task, a Model-based method may take about 20
minutes while a Policy Gradient method may take weeks. However, this advantage
diminishes when physical simulations can be replaced by computer simulations.
lOMoARcPSD|369 802 53

80

Since the trajectory optimization in Model-based methods is far more complex,


Model-free RL will be more favorable if computer simulations are accurate enough.
Also, to simplify the computation, Model-based methods have more assumptions and
approximations and therefore, limit the trained models to fewer tasks.

Learn the model

In Model-based RL, the model may be known or learned. In the latter case, we run
a base policy, like a random or any educated policy, and observe the trajectory.
Then, we fit a model using this sampled data.

In step 2 above, we use supervised learning to train a model to minimize the least
square error from the sampled trajectory. In step 3, we use any trajectory
optimization method, like iLQR, to calculate the optimal trajectory using the model
and a cost function that say measure how far we are from the target location and the
amount of effort spent.
Learn the model iteratively

However, it is vulnerable to drifting. Tiny errors accumulate fast along the


trajectory. The search space is too big for any base policy to cover fully. We may
lOMoARcPSD|369 802 53

81

land in areas where the model has not been learned yet. Without a proper model
around these areas, we cannot plan the optimal controls.

To address that, instead of learning the model at the beginning, we continue to


sample and fit the model as we move along the path.

So we repeat step 2 and step 4 and continue collecting samples and fitting the
model around the searched space.

MPC (Model Predictive Control)

Nevertheless, the previous method executes all planned actions before fitting
the model again. We may be off-course too far already.
lOMoARcPSD|369 802 53

82

In MPC, we optimize the whole trajectory but we take the first action only. We
observe and replan again. The replan gives us a chance to take corrective action
after observed the current state again. For a stochastic model, this is particularly
helpful.

By constantly changing plans, we are less vulnerable to problems in the model.


Hence, MPC allows us to have models that are far less accurate
Backpropagate to policy

The controls produced by a controller are calculated using a model and a cost
function using the trajectory optimization methods like iLQR.
lOMoARcPSD|369 802 53

83

However, we can also model a policy π directly using a deep network or a Gaussian
Process. For example, we can use the model to predict the next state given an
action. Then, we use the policy to decide the next action, and use the state and
action to compute the cost. Finally, we backpropagate the cost to train the policy.
lOMoARcPSD|369 802 53

84

Temporal Difference (TD) learning

Temporal Difference (TD) learning is likely the most core concept in


Reinforcement Learning. Temporal Difference learning, as the name suggests,
focuses on the differences the agent experiences in time.

The methods aim to - for some policy (\ \pi \), provide and update some
estimate V for the value of the policy vπ for all states or state-action pairs, updating
as the agent experiences them.

1. Gamma (γ): the discount rate. A value between 0 and 1. The higher the
value the less you are discounting.

2. Lambda (λ): the credit assignment variable. A value between 0 and 1. The
higher the value the more credit you can assign to further back states and
actions.
lOMoARcPSD|369 802 53

85

3. Alpha (α): the learning rate. How much of the error should we accept and
therefore adjust our estimates towards. A value between 0 and 1. A higher
value adjusts aggressively, accepting more of the error while a smaller one
adjusts conservatively but may make more conservative moves towards
the actual values.

4. Delta (δ): a change or difference in value

The most basic method for TD learning is the TD(0) method. Temporal-
Difference TD(0) learning updates the estimated value of a state V for policy based
on the reward the agent received and the value of the state it transitioned to.

Specifically, if our agent is in a current state st, takes the action at and receives
the reward rt, then we update our estimate of V following
V(st)←V(st)+α[rt+1+γV(st+1)–V(st)],

a simple diagram of which can be seen below. The value of [rt+1+γV(st+1)–


V(st)] is commonly called the TD Error and is used in various forms through-out
Reinforcement Learning. Here the TD error is the difference between the current
estimate for Vt, the discounted value estimate of Vt+1 and the actual reward gained
from transitioning between st and st+1. Hence correcting the error in Vt slowly over
many passes through. α is a constant step-size parameter that impacts how
quickly the Temporal Difference algorithm learns. For the algorithms following,
we generally require α to be suitably small to guarantee convergence, however the
smaller the value of alpha the smaller the changes made for each update, and
therefore the slower the convergence.
lOMoARcPSD|369 802 53

86

Example of the TD(0) Update

Temporal Difference learning is just trying to estimate the value function vπ(st), as
an estimate of how much the agent wants to be in certain state, which we
repeatedly improve via the reward outcome and the current estimate
of vπ(st+1). This way, the estimate of the current state relies on the estimates of all
future states, so information slowly trickles down over many runs through the chain.

Rather than estimating the state-value function, it is commonly more effective to


estimate the Action-Value pair for a particular policy qπ(s,a), for s∈S and a∈A,
commonly referred to as Q-Values (because a certain something came first). These
are typically stored in an array, each cell referring to a specific state-action Q-Value.

Q-Learning is arguably thee most popular Reinforcement Learning Policy


method. It is particularly popular as it has a formula that is both simplistic to follow
and compute. The aim is to learn an estimate Q(s,a) of the optimal q∗(s,a) by having
our agent play through and experience our series of actions and states, updating our
estimates following
lOMoARcPSD|369 802 53

87

Q(st,at)←Q(st,at)+α[rt+1+γmaxaQ(st+1,a)–Q(st,at)].

Here our Q-Values are estimated by comparing the current Q-Value to the reward
gained plus the maximal greedy option available to our agent during the next
state st+1 (A similar figure to the one for TD(0) is below), and hence we can
calculate our estimated action-value function Q(s,a) directly. This estimate is all
independent of the policy currently being followed. The policy the agent currently
follows only impacts which states will be visited upon selecting our action in the
new state and moving there-after. Q-learning performs updates only as a function of
the seemingly optimal actions, regardless of what action will be chosen.

Example of the Q-Learning update

SARSA is an on-policy Temporal Difference control method and can be seen as a


more complex Q-Learning method. By on-policy, we refer to the idea that the
estimate of qπ(st,at) is dependent on our current policy π and we assume when we
make the update that we will continue with π for the remainder of the agents current
lOMoARcPSD|369 802 53

88

episode, whatever states or actions that might include them choosing. For the Sarsa
Method, we make the update
Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)–Q(st,at)].

The SARSA algorithm has one conceptual problem, in that when updating we imply
we know in advance what the next action at+1 is for any possible next state. This
requires that we step forward and calculate the next action of our policy when
updating, and therefore learning is highly dependent on the current policy the agent
is following. This complicates the exploration process, and it is therefore common
to use some form of ϵ−soft policy for on-policy methods.

You might also like