Professional Documents
Culture Documents
lOMoARcPSD|369 802 53
In the previous topic, we learned supervised machine learning in which models are
trained using labeled data under the supervision of training data. But there may be
many cases in which we do not have labeled data and need to find the hidden patterns
from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.
Below are some main reasons which describe the importance of Unsupervised
Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which
make unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output
so to solve such cases, we need unsupervised learning.
lOMoARcPSD|369 802 53
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to
find the hidden patterns from the data and then will apply suitable algorithms such
as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of
problems:
lOMoARcPSD|369 802 53
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
lOMoARcPSD|369 802 53
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns. It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common
uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-series
to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
Types of Clustering
1. Exclusive Clustering
2. Overlapping Clustering
3. Hierarchical Clustering
lOMoARcPSD|369 802 53
Here you can see all similar datapoints are clustered. All the blue-colored data
points are clustered into the blue cluster and all the red-colored data points are
clustered into the red cluster.
In this, we can see that some of the blue data points and some of the pink data
points are overlapped.
Observe this pic .There are 6 different data points namely, A, B, C, D, E, and F.
The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:
lOMoARcPSD|369 802 53
10
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
lOMoARcPSD|369 802 53
11
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
12
Hierarchical Clustering
13
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used
in machine learning:
14
Applications of Clustering
15
16
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.T he below diagram explains the working of the K-means Clustering
Algorithm:
lOMoARcPSD|369 802 53
17
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
lOMoARcPSD|369 802 53
18
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.
o Now we will assign each data point of the scatter plot to its closest K-pointer
centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw median
between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
lOMoARcPSD|369 802 53
19
From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.
As reassignment has taken place, so we will again go to the step -4, which is finding
new centroids or K-points. We will repeat the process by finding the center of gravity
lOMoARcPSD|369 802 53
20
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below
image:
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the image:
lOMoARcPSD|369 802 53
21
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of
clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
22
Since the graph shows the sharp bend, which looks like an elbow, hence it is known
as the elbow method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot.
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
lOMoARcPSD|369 802 53
23
Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
24
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data
points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
Step-4: Repeat Step 3 until only one cluster left. So, we will get the
lOMoARcPSD|369 802 53
25
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods. Some of the popular linkage methods are given below:
1. Single Linkage: It is the Shortest Distance between the closest p oints of the
clusters. Consider the below image:
2. Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
lOMoARcPSD|369 802 53
26
From the above-given approaches, we can apply any of them according to the type
of problem or business requirement.
The dendrogram is a tree-like structure that is mainly used to store each step as a
memory that the HC algorithm performs. In the dendrogram plot, the Y-axis shows
the Euclidean distances between the data points, and the x-axis shows all the data
points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
lOMoARcPSD|369 802 53
27
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.
We can cut the dendrogram tree structure at any level as per our requirement
lOMoARcPSD|369 802 53
28
Cluster Validity
For cluster analysis, the analogous question is how to evaluate the “goodness” of
the resulting clusters?
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.
29
30
=
Separation is measured by the between-cluster sum of squares
=
Where |Ci| is the size of cluster i and m is the mean of the means.
lOMoARcPSD|369 802 53
31
A proximity graph-based approach can also be used for cohesion and separation.
Cluster cohesion is the sum of the weight of all links within a cluster.
Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.
Now, let’s discuss 2 internal cluster validity indices namely Dunn index and DB
index
Dunn index:
The Dunn index (DI) (introduced by J. C. Dunn in 1974), a metric for
evaluating clustering algorithms, is an internal evaluation scheme, where the
result is based on the clustered data itself. Like all other such indices, the aim of
this Dunn index to identify sets of clusters that are compact, with a small
variance between members of the cluster, and well separated, where the means
lOMoARcPSD|369 802 53
32
Higher the Dunn index value, better is the clustering. The number of
clusters that maximizes Dunn index is taken as the optimal number of clusters k. It
also has some drawbacks. As the number of clusters and dimensionality of the data
increase, the computational cost also increases. The Dunn index for c number of
clusters is defined:
DB index:
The Davies–Bouldin index (DBI) (introduced by David L. Davies and
Donald W. Bouldin in 1979), a metric for evaluating clustering algorithms, is an
internal evaluation scheme, where the validation of how well the clustering has
been done is made using quantities and features inherent to the dataset.
Lower the DB index value, better is the clustering. It also has a drawback.
A good value reported by this method does not imply the best information
lOMoARcPSD|369 802 53
33
retrieval.
The DB index for k number of clusters is defined as :
Many interesting algorithms are applied to analyze very large datasets. Most
algorithms don’t provide any means for its validation and evaluation. So it is very
difficult to conclude which are the best clusters and should be taken for analysis.
1. Silhouette Index
2. Dunn Index
3. DB Index
4. CS Index
5. I- Index
6. XB or Xie Beni Index
lOMoARcPSD|369 802 53
34
Silhouette Index –
The silhouette plot displays a measure of how close each point in one cluster
is to points in the neighboring clusters and thus provides a way to assess parameters
like number of clusters visually.
The Silhouette validation technique calculates the silhouette index for each sample,
average silhouette index for each cluster and overall average silhouette index for
a dataset. Using the approach each cluster could be represented by Silhouette
index, which is based on the comparison of its tightness and separation.
If the Silhouette index value is high, the object is well-matched to its own
cluster and poorly matched to neighbouring clusters. The Silhouette Coefficient is
calculated using the mean intra-cluster distance (a) and the mean nearest-cluster
distance (b) for each sample. The Silhouette Coefficient is defined as –
35
a(i) is the average dissimilarity of ith object to all other objects in the same
cluster
b(i) is the average dissimilarity of ith object with all objects in the closest
cluster.
Range of Silhouette Value –
Now, obviously S(i) will lie between [-1, 1] –
1. If silhouette value is close to 1, sample is well-clustered and already
assigned to a very appropriate cluster.
2. If silhouette value is about to 0, sample could be assign to another cluster
closest to it and the sample lies equally far away from both the clusters.
That means it indicates overlapping clusters
3. If silhouette value is close to –1, sample is misclassified and is merely
placed somewhere in between the clusters
EXTERNAL INDEX
There are different metrics used to evaluate the performance of a clustering
model or clustering quality.
Purity
Normalized mutual information (NMI)
Rand index
Purity
Purity is quite simple to calculate. We assign a label to each cluster based on the most
frequent class in it. Then the purity becomes the number of correctly matched class
and cluster labels divided by the number of total data points. Consider a case where
lOMoARcPSD|369 802 53
36
our clustering model groups the data p oints into 3 clusters as seen below:
Each cluster is assigned with the most frequent class label. We sum the number of
correct class labels in each cluster and divide it by the total number of data points.
Pi is the probability of the label i (P(i)). Let’s calculate the entropy of the class labels
in the previous examples.
lOMoARcPSD|369 802 53
37
We can calculate the probability of a class label by dividing the number of data points
belong to that class to the total number of data points. For instance, probability of
class A is 6 / 18.
The entropy in our case is calculated as below. If you run the calculation, you will
see that the result is 1.089.
Rand index
a is the number of times a pair of elements are in the same cluster for both
actual and predicted clustering which we calculate as 2.
b is the number of times a pair of elements are not in the same cluster for
both actual and predicted clustering which we calculate as 8.
The expression in the denominator is the total number of binomial
coefficients which is 15.
Thus, rand index in this case is 10 / 15 = 0.6
38
A dataset contains a huge number of input features in various cases, which makes
the predictive modeling task more complicated. Because it is very difficult to
visualize or make predictions for the training dataset with a high number of features,
for such cases, dimensionality reduction techniques are required to use.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be used
for data visualization, noise reduction, cluster analysis, etc.
lOMoARcPSD|369 802 53
39
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
o By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of
features.
o Reduced dimensions of features of the dataset help in visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.
40
There are also some disadvantages of applying the dimensionality reduction, which
are given below:
41
There are two ways to apply the dimension reduction technique, which are given
below:
Feature Selection Feature selection is the process of selecting the subset of the
relevant features and leaving out the irrelevant features present in a dataset to
build a model of high accuracy. In other words, it is a way of selecting the optimal
features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML
model, and evaluate the performance. The performance decides whether to add
those features or remove to increase the accuracy of the model. This method is more
accurate than the filtering method but complex to work. Some common techniques
of wrapper methods are:
lOMoARcPSD|369 802 53
42
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
b. Backward Elimination
lOMoARcPSD|369 802 53
43
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels.
The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this
technique to reduce the dimensionality or in feature selection:
lOMoARcPSD|369 802 53
44
o In this technique, firstly, all the n variables of the given dataset are taken to
train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features
for n times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features;
after that, we will be left with n-1 features.
In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the
machine learning algorithms.
Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will
find the best features that can produce the highest increase in the performance of the
model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each feature
at a time.
o Here we will train the model on each feature separately.
45
If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if
a variable has missing values more than that threshold, we will drop that variable.
The higher the threshold value, the more efficient the reduction.
As same as missing value ratio technique, data columns with some changes in the
data have less information. Therefore, we need to calculate the variance of each
variable, and all data columns with variance lower than a given threshold are
dropped because low variance features will not affect the target variable.
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value
of the correlation coefficient. If this value is higher than the threshold value, we can
remove one of the variables from the dataset. We can consider those variables or
features that show a high correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do
not need to program it separately. In this technique, we need to generate a large set
of trees against the target variable, and with the help of usage statistics of each
attribute, we need to find the subset of features.
lOMoARcPSD|369 802 53
46
Random forest algorithm takes only numerical variables, so we need to convert the
input data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according
to the correlation with other variables, it means variables within a group can have a
high correlation between themselves, but they have a low correlation with variables
of other groups.
Auto-encoders
o Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
o Decoder: The function of the decoder is to recreate the output from the latent-
space representation.
lOMoARcPSD|369 802 53
47
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.
48
o Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.
Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.
As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of these principal
components are given below:
49
50
As there are as many principal components as there are variables in the data,
principal components are constructed in such a manner that the first principal
lOMoARcPSD|369 802 53
51
component accounts for the largest possible variance in the data set. For example,
let’s assume that the scatter plot of our data set is as shown below, can we guess
the first principal component ? Yes, it’s approximately the line that matches the
purple marks because it goes through the origin and it’s the line in which the
projection of the points (red dots) is the most spread out. Or mathematically
speaking, it’s the line that maximizes the variance (the average of the squared
distances from the projected points (red dots) to the origin).
The second principal component is calculated in the same way, with the condition
that it is uncorrelated with (i.e., perpendicular to) the first principal component and
that it accounts for the next highest variance.
This continues until a total of p principal components have been calculated, equal
to the original number of variables.
52
dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with 3
corresponding eigenvalues.
Without further ado, it is eigenvectors and eigenvalues who are behind all the
magic explained above, because the eigenvectors of the Covariance matrix are
actually the directions of the axes where there is the most variance(most
information) and that we call Principal Components. And eigenvalues are simply
the coefficients attached to eigenvectors, which give the amount of variance
carried in each Principal Component.By ranking your eigenvectors in order of
their eigenvalues, highest to lowest, you get the principal components in order of
significance
53
Recommender systems are the systems that are designed to recommend things to the
user based on many different factors. These systems predict the most likely
product that the users are most likely to purchase and are of interest to.
Companies like Netflix, Amazon, etc. use recommender systems to help their users
to identify the correct product or movies for them.
The recommender system deals with a large volume of information present by
filtering the most important information based on the data provided by a user and
other factors that take care of the user’s preference and interest. It finds out the
match between user and item and imputes the similarities between users and items
for recommendation.
Both the users and the services provided have benefited from these kinds of systems.
The quality and decision-making process has also improved through these kinds of
systems.
Why the Recommendation system?
Benefits users in finding items of their interest.
Help item providers in delivering their items to the right user.
Identity products that are most relevant to users.
Personalized content.
Help websites to improve user engagement.
lOMoARcPSD|369 802 53
54
55
56
It does not suffer from cold start problems which means on day 1 of the
business also it can recommend products on various different filters.
There is no need for the user's historical data.
Demerits of popularity based recommendation system
Not personalized
The system would recommend the same sort of products/movies which are
solely based upon popularity to every other user.
Example
Google News: News filtered by trending and most popular news.
YouTube: Trending videos.
2. Classification Model
The model that uses features of both products as well as users to predict whether a
user will like a product or not.
Classification model
The output can be either 0 or 1. If the user likes it then 1 and vice-versa.
lOMoARcPSD|369 802 53
57
It is a rigorous task to collect a high volume of information about different users and
also products.
58
To check the similarity between the products or mobile phone in this example,
the system computes distances between them. One plus 7 and One plus 7T both have
8Gb ram and 48MP primary camera.
If the similarity is to be checked between both the products, Euclidean distance is
calculated. Here, distance is calculated based on ram and camera;
59
Cosine Similarity: Cosine of the angle between the two vectors of the item, vectors
of A and B is calculated for imputing similarity. If the vectors are closer, then small
will be the angle and large will be the cosine.
Cosine Similarity
Jaccard Similarity: Users who have rated item A and B divided by the total number
of users who have rated either A or B gives us the similarity. It is used for comparing
the similarity.
Jaccard Similarity
Merits
60
Demerits
4. Collaborative Filtering
It is considered to be one of the very smart recommender systems that work
on the similarity between different users and also items that are widely used as an e-
commerce website and also online movie websites. It checks about the taste of
similar users and does recommendations.
The similarity is not restricted to the taste of the user moreover there can be
consideration of similarity between different items also. The system will give more
efficient recommendations if we have a large volume of information about users and
items.
61
Figure 2 shows the two different users and their interests along with the
similarity between the taste of both the users. It is found that both Jil and Megan
have similar tastes so Jill's interest is recommended to Megan and vice versa.
This is the way collaborative filtering works. Mainly, there are two
approaches used in collaborative filtering stated below;
a) User-based nearest-neighbor collaborative filtering
62
Limitations
Enough users required to find a match. To overcome such cold start problems,
often hybrid approaches are made use of between CF and Content -based
matching.
Even if there are many users and many items that are to be recommended
often, problems can arise of user and rating matrix to be sparse and will
become challenging to find out about the users who have rated the same item.
The problem in recommending items to the user due to sparsity problems.
63
For the part of the recommendation, the only part which is taken care of is
matrix factorization that is done the user-item rating matrix. Matrix-factorization
is all about taking 2 matrices whose product is the original matrix. Vectors are used
to represent item ‘qi’ and user ‘pu’ such that their dot product is the expected
rating.
64
Bias Term
The minimized equation is,
Minimizing with Stochastic Gradient Descent (SGD): SGD is used to reduce the
above equation. SGD functions by taking the parameters of the equation which we
are trying to reduce to initial values and then iterating it to minimize the incorrect
error between the actual value & the predicted value by making the use of a small
factor each time to correct.
lOMoARcPSD|369 802 53
65
SGD makes the usage of the learning rate to check about the previous values
and the new value after every other iteration.
The EM algorithm is considered a latent variable model to find the local maximum
likelihood parameters of a statistical model, proposed by Arthur Dempster, Nan
Laird, and Donald Rubin in 1977. The EM (Expectation-Maximization) algorithm
is one of the most commonly used terms in machine learning to obtain maximum
likelihood estimates of variables that are sometimes observable and sometimes
not. However, it is also applicable to unobserved data or sometimes called latent.
What is an EM algorithm?
66
Key Points:
EM Algorithm
67
o Maximization step (M - step): This step involves the use of estimated data
in the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to
update the values of the parameters in the M-step.
Steps in EM Algorithm
lOMoARcPSD|369 802 53
68
o 1st Step: The very first step is to initialize the parameter values. Further, the
system is provided with incomplete observed data with the assumption that
data is obtained from a specific model.
The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM
also requires estimated statistics values such as mean and standard deviation or
parameters. It is used to estimate the parameters of the probability distributions to
best fit the density of a given training dataset. Although there are plenty of
techniques available to estimate the parameter of the Gaussian Mixture Model
(GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.
Let's understand a case where we have a dataset with multiple data points generated
by two different processes. However, both processes contain a similar Gaussian
lOMoARcPSD|369 802 53
69
The processes used to generate the data point represent a latent variable or
unobservable data. In such cases, the Estimation-Maximization algorithm is one of
the best techniques which helps us to estimate the parameters of the gaussian
distributions. In the EM algorithm, E-step estimates the expected value for each
latent variable, whereas M-step helps in optimizing them significantly using the
Maximum Likelihood Estimation (MLE). Further, this process is repeated until a
good set of latent values, and a maximum likelihood is achieved that fits the data.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable
model has a broad range of real-life applications in machine learning. These are as
follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as
the Gaussian Mixture Model and quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent
abilities of item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.
Advantages of EM algorithm
lOMoARcPSD|369 802 53
70
o It is very easy to implement the first two basic steps of the EM algorithm in
various machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
Disadvantages of EM algorithm
Conclusion
71
Reinforcement Learning
o It is a core part of Artificial intelligence, and all AI agent works on the concept
of reinforcement learning. Here we do not need to pre-program the agent, as
it learns from its own experience without any human intervention.
o Example: Suppose there is an AI agent present within a maze environment,
and his goal is to find the diamond. The agent interacts with the environment
lOMoARcPSD|369 802 53
72
by performing some actions, and based on those actions, the state of the agent
gets changed, and it also receives a reward or penalty as feedback.
o The agent continues doing these three things (take action, change
state/remain in the same state, and get feedback), and by doing these
actions, he learns and explores the environment.
o The agent learns that what actions lead to positive feedback or rewards and
what actions lead to negative feedback penalty. As a positive reward, the agent
gets a positive point, and as a penalty, it gets a negative point.
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In
RL, we assume the stochastic environment, which means it is random in
nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action
taken by the agent.
lOMoARcPSD|369 802 53
73
o In RL, the agent is not instructed about the environment and what actions need
to be taken.
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which
is the maximum value at a state under any policy. Therefore, the agent expects
the long-term return at any state(s) under policy π.
lOMoARcPSD|369 802 53
74
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future
rewards without using the value function. In this approach, the agent tries
to apply such a policy that the action performed in each step helps to
maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any
state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model
representation is different for each environment.
There are four main elements of Reinforcement Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
1) Policy: A policy can be defined as a way how an agent behaves at a given time.
It maps the perceived states of the environment to the actions taken on those states.
A policy is the core element of the RL as it alone can define the behavior of the
agent. In some cases, it may be a simple function or a lookup table, whereas, for
lOMoARcPSD|369 802 53
75
3) Value Function: The value function gives information about how good the
situation and action are and how much reward an agent can expect. A reward
indicates the immediate signal for each good and bad action, whereas a value
function specifies the good state and action for the future. The value function
depends on the reward as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.
4) Model: The last element of reinforcement learning is the model, which mimics
the behavior of the environment. With the help of the model, one can make
inferences about how the environment will behave. Such as, if a state and an action
are given, then a model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of
action by considering all future situations before actually experiencing those
situations. The approaches for solving the RL problems with the help of the
lOMoARcPSD|369 802 53
76
77
Control is just another term for action in RL. An action is often written
as a or u with states as s or x. A controller uses a model (the system dynamics) to
decide the controls in an optimal trajectory which is expressed as a sequence of
states and controls.
In model-based RL, we optimize the trajectory for the least cost instead of the
maximum rewards.
As mentioned before, Model-free RL ignores the model and care less about the
inner working. We fall back to sampling to estimate rewards.
With a cost function, we find an optimal trajectory with the lowest cost.
lOMoARcPSD|369 802 53
78
Known models
In many games, like GO, the rule of the game is the model.
AlphaGO
In other cases, it can be the law of Physics. Sometimes, we know how to model it
and build simulators for it.
lOMoARcPSD|369 802 53
79
We can define this model with rules or equations. Or, we can model it, like using
the Gaussian Process, Gaussian Mixture Model (GMM) or deep networks. To fit
these models, we run a controller to collect sample trajectories and train the models
with supervised learning.
Motivation
80
In Model-based RL, the model may be known or learned. In the latter case, we run
a base policy, like a random or any educated policy, and observe the trajectory.
Then, we fit a model using this sampled data.
In step 2 above, we use supervised learning to train a model to minimize the least
square error from the sampled trajectory. In step 3, we use any trajectory
optimization method, like iLQR, to calculate the optimal trajectory using the model
and a cost function that say measure how far we are from the target location and the
amount of effort spent.
Learn the model iteratively
81
land in areas where the model has not been learned yet. Without a proper model
around these areas, we cannot plan the optimal controls.
So we repeat step 2 and step 4 and continue collecting samples and fitting the
model around the searched space.
Nevertheless, the previous method executes all planned actions before fitting
the model again. We may be off-course too far already.
lOMoARcPSD|369 802 53
82
In MPC, we optimize the whole trajectory but we take the first action only. We
observe and replan again. The replan gives us a chance to take corrective action
after observed the current state again. For a stochastic model, this is particularly
helpful.
The controls produced by a controller are calculated using a model and a cost
function using the trajectory optimization methods like iLQR.
lOMoARcPSD|369 802 53
83
However, we can also model a policy π directly using a deep network or a Gaussian
Process. For example, we can use the model to predict the next state given an
action. Then, we use the policy to decide the next action, and use the state and
action to compute the cost. Finally, we backpropagate the cost to train the policy.
lOMoARcPSD|369 802 53
84
The methods aim to - for some policy (\ \pi \), provide and update some
estimate V for the value of the policy vπ for all states or state-action pairs, updating
as the agent experiences them.
1. Gamma (γ): the discount rate. A value between 0 and 1. The higher the
value the less you are discounting.
2. Lambda (λ): the credit assignment variable. A value between 0 and 1. The
higher the value the more credit you can assign to further back states and
actions.
lOMoARcPSD|369 802 53
85
3. Alpha (α): the learning rate. How much of the error should we accept and
therefore adjust our estimates towards. A value between 0 and 1. A higher
value adjusts aggressively, accepting more of the error while a smaller one
adjusts conservatively but may make more conservative moves towards
the actual values.
The most basic method for TD learning is the TD(0) method. Temporal-
Difference TD(0) learning updates the estimated value of a state V for policy based
on the reward the agent received and the value of the state it transitioned to.
Specifically, if our agent is in a current state st, takes the action at and receives
the reward rt, then we update our estimate of V following
V(st)←V(st)+α[rt+1+γV(st+1)–V(st)],
86
Temporal Difference learning is just trying to estimate the value function vπ(st), as
an estimate of how much the agent wants to be in certain state, which we
repeatedly improve via the reward outcome and the current estimate
of vπ(st+1). This way, the estimate of the current state relies on the estimates of all
future states, so information slowly trickles down over many runs through the chain.
87
Q(st,at)←Q(st,at)+α[rt+1+γmaxaQ(st+1,a)–Q(st,at)].
Here our Q-Values are estimated by comparing the current Q-Value to the reward
gained plus the maximal greedy option available to our agent during the next
state st+1 (A similar figure to the one for TD(0) is below), and hence we can
calculate our estimated action-value function Q(s,a) directly. This estimate is all
independent of the policy currently being followed. The policy the agent currently
follows only impacts which states will be visited upon selecting our action in the
new state and moving there-after. Q-learning performs updates only as a function of
the seemingly optimal actions, regardless of what action will be chosen.
88
episode, whatever states or actions that might include them choosing. For the Sarsa
Method, we make the update
Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)–Q(st,at)].
The SARSA algorithm has one conceptual problem, in that when updating we imply
we know in advance what the next action at+1 is for any possible next state. This
requires that we step forward and calculate the next action of our policy when
updating, and therefore learning is highly dependent on the current policy the agent
is following. This complicates the exploration process, and it is therefore common
to use some form of ϵ−soft policy for on-policy methods.