ML Unit V

UNIT-V
1
UNIT-V
• Clustering: Introduction, Similarity and Distance Measures, Outliers,
Hierarchical Methods, Partitional Algorithms, Clustering Large
Databases, Clustering with Categorical Attributes, Comparison.
• Dimensionality Reduction: Linear Discriminant Analysis, Principal
Component Analysis
• Interactive Learning: Active Learning, Common heuristics, Sampling
bias, Safe Disagreement Based Active Learning Schemes
• Semi-Supervised Learning: Semi-supervised Learning, Transductive
SVM, Co-training
• Reinforcement Learning: Markov Decision Processes, Value
Iteration, Q-Learning.
2
CLUSTERING
INTRODUCTION : Clustering is a technique Manhattan distance: Also known as the "city block distance", this
in machine learning that involves grouping measure calculates the sum of the absolute differences between the
similar objects or data points into clusters based features of two data points. This measure is useful when the data is
on some similarity or distance measure. in a grid-like structure, such as images or texts.
There are various similarity and distance Manhattan distance: The Manhattan distance (L1 distance) between
measures that can be used in clustering, some of two n-dimensional data points x and y is given by:
the most commonly used ones are: d(x, y) = |x1 - y1| + |x2 - y2| + ... + |xn - yn|
Euclidean distance: This is the most common where xi and yi are the values of the ith feature of x and y,
distance measure used in clustering. It calculates respectively.
the straight-line distance between two data Cosine similarity: This measure calculates the cosine of the angle
points in a n-dimensional space. This measure is between two data points in a multi-dimensional space. It is
appropriate when the data points are continuous commonly used in text and image processing, where the data is
and the features have the same unit of represented as vectors.
measurement. Cosine similarity is a measure of similarity between two non-zero
Euclidean distance: The Euclidean distance n-dimensional vectors, and it is commonly used in text and image
between two n-dimensional data points x and y processing applications. The formula for cosine similarity between
is given by: two vectors x and y is:
d(x, y) = √((x1 - y1)² + (x2 - y2)² + ... + (xn - yn)²) similarity(x, y) = (x . y) / (||x|| ||y||)
where xi and yi are the values of the ith feature of where x and y are two n-dimensional vectors, ||x|| and ||y|| are the
x and y, respectively. norms of x and y, respectively, and (x.y) is the dot product of x3 &
y.
The dot product of two vectors x and y is calculated as the Jaccard similarity: This measure is used in
sum of the products of their corresponding components: clustering of binary or categorical data. It calculates
x . y = x1y1 + x2y2 + ... + xn*yn the similarity between two data points based on the
The norm of a vector x is calculated as the square root of number of common features they share.
the sum of the squares of its components: Jaccard similarity: The Jaccard similarity between
||x|| = sqrt(x1^2 + x2^2 + ... + xn^2) two binary or categorical data points x and y is given
The resulting value of cosine similarity is a number by:
between -1 and 1, where values close to 1 indicate high sim(x, y) = |x ∩ y| / |x ∪ y|
similarity, values close to -1 indicate high dissimilarity, where x ∩ y is the intersection of x and y, and x ∪ y
and a value of 0 indicates that the vectors are orthogonal or is the union of x and y.
have no similarity. These formulae can be used to compute the
Pearson correlation coefficient: This measure is similarity or distance between any two data points in
commonly used in clustering of numerical data. It a dataset, which can then be used to group similar
calculates the correlation between the features of two data data points together into clusters.
points, taking into account the means and standard The choice of similarity or distance measure depends
deviations of the features. on the type of data being clustered and the problem
Pearson correlation coefficient: The Pearson correlation at hand. It is important to choose a measure that is
coefficient between two n-dimensional data points x and y appropriate for the data and the problem to obtain
is given by: meaningful results.
corr(x, y) = (x - μx) . (y - μy) / (σx σy)
where μx and μy are the means of x and y, and σx and σy are
the standard deviations of x and y, respectively. 4
Hierarchical clustering is a method of clustering in which data There are several methods for determining the
points are grouped together based on their similarity or distance between clusters, including single
dissimilarity. linkage, complete linkage, and average linkage.
There are two main types of hierarchical clustering methods: Single linkage connects the closest points in two
agglomerative and divisive. clusters, while complete linkage connects the
Agglomerative hierarchical clustering starts with each data point farthest points in two clusters.
as a separate cluster and then iteratively merges the two closest Average linkage computes the average distance
clusters into a new larger cluster until all data points are in a between all pairs of points in two clusters.
single cluster. Hierarchical clustering is a useful method for
This process can be visualized as a dendrogram, a tree-like exploratory data analysis and can help to identify
structure that shows the order in which the clusters were natural groupings or patterns in the data.
merged. However, it can be computationally expensive,
Divisive hierarchical clustering, on the other hand, starts with particularly for large datasets.
all data points in a single cluster and then iteratively splits the Additionally, the choice of distance measure and
cluster into two smaller clusters based on their dissimilarity, clustering method can have a significant impact
until each data point is in a separate cluster. on the results, and it can be difficult to interpret
Both agglomerative and divisive hierarchical clustering methods the dendrogram when there are many clusters.
require a measure of similarity or dissimilarity between data
points, as well as a method for determining the distance between
clusters.
The most common distance measures used in hierarchical
clustering are the Euclidean distance and the Manhattan
5
distance.
Partitioning algorithms are a class of algorithms used Density-based clustering: This type of algorithm
in computer science and data analysis to divide a set of identifies clusters based on regions of high data density,
data points into distinct groups or partitions based on rather than on explicit boundaries between clusters.
certain criteria. Points within a dense region are considered to be part of
These algorithms are commonly used in clustering and the same cluster, while points in low-density regions are
classification tasks, and they aim to group data points considered outliers or noise.
that are similar to each other while keeping different Spectral clustering: This algorithm uses the
groups as distinct as possible. eigenvectors of a similarity matrix to partition data
Here are some common types of partitioning points into clusters.
algorithms: It is particularly useful when dealing with non-linearly
K-means algorithm: This is a popular clustering separable data, as it can capture the underlying manifold
algorithm that partitions data points into K clusters structure of the data.
based on their similarity to each other. Partitioning Around Medoids (PAM): PAM algorithm
It is an iterative algorithm that minimizes the sum of is similar to K-means, but instead of using cluster
squared distances between data points and their centers (means), it uses actual data points (medoids) as
assigned cluster centroids. representatives of the clusters.
Hierarchical clustering: This algorithm builds a tree- This makes PAM more robust to noise and outliers.
like structure of clusters by recursively merging or Overall, partitioning algorithms are powerful tools for
dividing data points based on their similarity. data analysis and can help reveal underlying patterns in
The result is a hierarchical representation of the data, large datasets.
which can be cut at different levels to obtain different However, the choice of algorithm depends on the type of
numbers of clusters. data and the specific analysis goals.
6
CLUSTERING LARGE DATABASES is a common task Evaluation: It is important to evaluate the quality of
in data analysis and data mining. the clustering results to ensure that the clusters are
The goal is to group similar data points together into meaningful and useful.
clusters, based on some similarity metric, so that it This can be done by examining the clustering results
becomes easier to analyze and understand the data. visually, or by using metrics such as silhouette score or
Here are some common techniques and considerations for Rand index.
clustering large databases: Iteration: Clustering can be an iterative process, where
Preprocessing: Before clustering, it is often necessary to the results of one clustering analysis are used to inform
preprocess the data to remove noise, handle missing the next iteration.
values, and normalize the data. This can help to refine the clusters and improve the
Choosing a clustering algorithm: There are many overall quality of the analysis.
clustering algorithms available, including K-means, Scalability: As the size of the database grows, it may
hierarchical clustering, density-based clustering, and more. become necessary to use distributed computing
The choice of algorithm depends on the specific techniques, such as MapReduce or Spark, to cluster the
characteristics of the data and the goals of the analysis. data efficiently.
Choosing a distance metric: A distance metric measures Overall, clustering large databases requires careful
the similarity between two data points. consideration of the data, the clustering algorithm, and
The choice of metric depends on the type of data being the computational resources available.
analyzed and the characteristics of the clustering With the right tools and techniques, it is possible to
algorithm. gain valuable insights from even the largest and most
Parallel processing: Clustering large databases can be complex datasets.
computationally intensive, so it may be necessary to use 7
parallel processing techniques to speed up the analysis.
Partitioning-based clustering: Partitioning-based
Clustering with Categorical Attributes:
clustering algorithms, such as k-modes and k-
Clustering with categorical attributes can be challenging
prototypes, are specifically designed for clustering
because most clustering algorithms are designed to work with
categorical data.
numerical data.
These algorithms use a dissimilarity measure that is
However, there are several techniques that can be used to
based on the mode or prototype of each cluster.
cluster categorical data effectively:
Ensemble-based clustering: Ensemble-based
Convert categorical data to numerical data: One approach is
clustering algorithms, such as Co-Association Matrix
to convert categorical data to numerical data, using techniques
Clustering (COCA), can be used to cluster
such as one-hot encoding, binary encoding, or ordinal
categorical data by combining multiple clustering
encoding.
algorithms or similarity measures.
These techniques represent categorical data as a set of binary or
Visualization: Visualization techniques, such as
numerical features that can be used with traditional clustering
parallel coordinates plots and t-SNE, can be used to
algorithms.
visualize the categorical data in a low-dimensional
Similarity measures for categorical data: Another approach
space, making it easier to identify clusters and
is to use similarity measures that are specifically designed for
patterns.
categorical data.
Overall, clustering with categorical attributes
There are several similarity measures that can be used,
requires careful consideration of the data and the
including Jaccard similarity, Dice similarity, and Overlap
clustering algorithm or technique used. The choice of
similarity.
technique depends on the characteristics of the data
Hierarchical clustering: Hierarchical clustering can be used to
and the goals of the analysis. With the right tools and
cluster categorical data by computing a distance or similarity
techniques, it is possible to gain valuable insights
matrix between pairs of data points, using a similarity measure
from categorical data. 8
that is appropriate for categorical data.
COMPARISON: When clustering data with Hierarchical clustering: Hierarchical clustering can be used with
categorical attributes, there are several techniques both numerical and categorical data, but it can be more effective
that can be used, as mentioned in the previous with categorical data if a similarity measure that is appropriate for
answer. categorical data is used.
Here's a comparison of some of the most common Hierarchical clustering is also useful for identifying clusters at
techniques: different levels of granularity.
Convert categorical data to numerical data: This Partitioning-based clustering: Partition-based clustering
technique works well with clustering algorithms algorithms, such as k-modes and k-prototypes, are specifically
that are designed to work with numerical data, such designed for clustering categorical data.
as k-means and hierarchical clustering. These algorithms can be very effective, especially when the data
However, the technique can create high- has a large number of categorical variables.
dimensional data, which can cause some algorithms However, the algorithms can be more computationally intensive
to perform poorly or require more computational and require more resources than other clustering techniques.
resources. Ensemble-based clustering: Ensemble-based clustering
Similarity measures for categorical data: algorithms can be effective when different clustering techniques
Similarity measures that are designed for or similarity measures are combined.
categorical data can perform well, especially with These algorithms can be more robust to the choice of clustering
hierarchical clustering or partition-based clustering technique or similarity measure, but they can also be more
algorithms. complex to implement.
However, the choice of similarity measure can have Visualization: Visualization techniques can be used to explore the
a significant impact on the clustering results, and data and identify clusters. However, visualization techniques can
some measures can be more appropriate than others be subjective and can be affected by the choice of visualization
depending on the nature of the data. technique or the dimensionality reduction technique used. 9
Overall, the choice of clustering technique depends on the Feature extraction: Feature extraction involves
nature of the data and the goals of the analysis. creating new features that are a combination of the
It may be useful to experiment with different techniques and original features.
compare the results to determine which approach is most This approach can be done through techniques such
effective for the particular dataset and clustering problem. as Principal Component Analysis (PCA), Linear
Dimensionality reduction is the process of reducing the Discriminant Analysis (LDA), or t-distributed
number of variables or features in a dataset while retaining Stochastic Neighbor Embedding (t-SNE).
as much of the important information as possible. Feature extraction can be more effective than feature
This is often done to simplify the data and make it easier to selection in preserving important information while
analyze, visualize, and interpret. still reducing the dimensionality of the data.
Dimensionality reduction can be particularly useful when However, it can also be more complex and
working with high-dimensional data, where the number of computationally intensive.
features can be very large. Some common techniques for dimensionality
There are two main approaches to dimensionality reduction: reduction include:
feature selection and feature extraction. Principal Component Analysis (PCA): PCA is a
Feature selection: Feature selection involves selecting a widely used technique that involves finding the linear
subset of the original features and discarding the rest. combinations of the original features that explain the
This approach can be based on various criteria such as most variance in the data.
variance, correlation, or mutual information. The resulting components are then used as the new
The main advantage of feature selection is that it can be features.
simple and efficient, but it may result in loss of important
information that is not captured by the selected features.
10
Linear Discriminant Analysis (LDA): LDA is a supervised This transformation is done in such a way that the new
technique that seeks to find a linear combination of the features have the highest possible between-class
original features that maximizes the separation between variance and the lowest possible within-class variance.
classes. The basic steps involved in LDA are as follows:
This approach is useful for classification problems. Standardize the data: LDA assumes that the features
t-distributed Stochastic Neighbor Embedding (t-SNE): are normally distributed, so we need to standardize the
t-SNE is a non-linear dimensionality reduction technique data to have zero mean and unit variance.
that is particularly useful for visualizing high-dimensional Compute the class means and overall mean:
data in a low-dimensional space. Compute the mean of each feature for each class and
Overall, dimensionality reduction is a powerful tool for the mean of each feature overall.
simplifying and analyzing complex data. Compute the between-class and within-class scatter
The choice of technique depends on the nature of the data, matrices: Compute the scatter matrix for each class
the goals of the analysis, and the computational resources and the scatter matrix for the overall data.
available. Compute the eigenvalues and eigenvectors of the
Linear Discriminant Analysis (LDA) is a supervised generalized eigenvalue problem: The eigenvectors
learning algorithm that is used for dimensionality reduction are used as the new features.
and classification. LDA is used when we have labeled data Select the top k eigenvectors: Select the top k
and want to find the best linear combination of the features eigenvectors that correspond to the k largest
that separates the classes in the data. eigenvalues. These eigenvectors are the new features
The main idea behind LDA is to find a linear transformation that are used for classification.
of the features that maximizes the separation between LDA can be used for both binary and multi-class
classes. classification problems. In the binary case, LDA finds
11
a line that separates the two classes.
In the multi-class case, LDA finds a hyperplane that Linear Discriminant Analysis (LDA) is a
separates the classes. supervised machine learning algorithm that can be
Compute the eigenvalues and eigenvectors of the used for classification problems.
generalized eigenvalue problem: The eigenvectors are The goal of LDA is to find a linear combination of
used as the new features. features that maximally separates different classes in
Select the top k eigenvectors: Select the top k eigenvectors the data. Here is the mathematical formula for LDA:
that correspond to the k largest eigenvalues. Assuming that we have two classes in our data,
These eigenvectors are the new features that are used for labeled as class 0 and class 1, and the input features
classification. are represented by a d-dimensional vector x, LDA
LDA can be used for both binary and multi-class tries to find a projection vector w such that the
classification problems. projected data w'x is well-separated by class.
In the binary case, LDA finds a line that separates the two The mean of each class, denoted as μ0 and μ1,
classes. respectively, can be computed as follows:
In the multi-class case, LDA finds a hyperplane that μ0 = (1/n0) * Σxi for all i such that yi = 0
separates the classes. μ1 = (1/n1) * Σxi for all i such that yi = 1
LDA has several advantages, including its simplicity and
where xi is the ith data point and yi is the label of
effectiveness for classification problems with well-separated
that data point (either 0 or 1). n0 and n1 are the
classes.
number of data points in class 0 and class 1,
However, LDA assumes that the classes have equal
respectively.
covariance matrices, which may not be true in practice.
The scatter matrix within each class, denoted as Sw,
LDA also assumes that the data is normally distributed,
which may not be the case for all datasets. can be computed as follows:
12 y =
Sw = Σi(xi - μi)(xi - μi)' for all i such that yi = 0 or i
The scatter matrix between classes, denoted as Principal Component Analysis (PCA) is a statistical
Sb, can be computed as follows: Sb = (μ1 - μ0)(μ1 technique used to reduce the dimensionality of a dataset.
- μ0)' It is a method of transforming the data to a new
The LDA projection vector w can be obtained by coordinate system in such a way that the first coordinate
solving the following optimization problem: (or principal component) captures the maximum amount
maximize J(w) = (w'Sb w) / (w'Sw w) of variation in the data, the second coordinate captures
subject to ||w|| = 1, where ||w|| represents the the maximum amount of remaining variation, and so on.
Euclidean norm of w. PCA works by identifying the linear combinations of the
The optimal projection vector w can be found original variables that account for the most variance in
using eigenvalue decomposition of the matrix the data.
Sw^-1Sb. These linear combinations are known as principal
The direction of the eigenvector corresponding components.
to the largest eigenvalue is the optimal projection The first principal component is the linear combination
direction. that accounts for the most variance in the data, the
Finally, the projected data can be obtained by second principal component is the linear combination
computing the dot product of the data points and that accounts for the most remaining variance, and so
the projection vector: on.
y = w'x. PCA can be used for a variety of purposes, including
The classification rule can then be based on the data visualization, feature extraction, and noise
sign of y: if y is positive, the data point belongs reduction.
to class 1, otherwise it belongs to class 0. 13
It is often used in machine learning to preprocess data The covariance matrix of the centered data is given by:
before applying other algorithms. C = 1/(n-1) Z'Z
By reducing the dimensionality of the data, PCA can also We then calculate the eigenvectors and eigenvalues of C.
help to reduce the computational complexity of The eigenvectors represent the principal components of
subsequent algorithms. the data, while the eigenvalues represent the amount of
PCA assumes that the data is normally distributed and variance explained by each principal component.
linearly related. The first principal component, denoted by PC1, is the
It can also be sensitive to outliers, and it is important to eigenvector corresponding to the largest eigenvalue.
scale the data before applying PCA to ensure that each The second principal component, denoted by PC2, is the
variable is given equal weight in the analysis. eigenvector corresponding to the second largest
Principal Component Analysis (PCA) is a statistical eigenvalue, and so on.
technique that involves finding the principal components The principal components can be used to transform the
of a dataset. original data into a new coordinate system.
The principal components are the directions of maximum The transformed data, denoted by Y, is given by:
variation in the data. Here is the mathematical formula for Y = ZX
PCA: where Z is the matrix of principal components, with each
Assuming that we have n observations, each with p column representing a principal component.
features, we can represent the dataset as an n x p matrix The amount of variance explained by each principal
X. We first center the data by subtracting the mean of component can be calculated as the ratio of its eigenvalue
each feature from each observation: to the sum of all eigenvalues.
Z = X - 1nX This can be used to determine how many principal
where 1n is an n x n matrix of ones. components to retain for a given application. 14
PCA can also be used for dimensionality reduction, by By doing so, active learning can achieve high
retaining only the top k principal components, where k is a performance with fewer labeled samples compared to
smaller number than the original number of features p. traditional supervised learning.
The transformed data Y can then be used as a lower- There are different active learning strategies that can
dimensional representation of the original data X. be used to select the most informative samples.
Some common strategies include uncertainty
INTERACTIVE LEARNING: Active Learning sampling, query by committee, and density-based
Active learning is a type of interactive learning where the sampling.
algorithm is allowed to choose which data samples to learn Uncertainty sampling: This strategy selects samples
from in order to improve its performance. that the model is uncertain about.
Unlike traditional supervised learning, where the algorithm For example, in a binary classification problem, the
is trained on a fixed set of labeled data, in active learning, algorithm may select samples that are close to the
the algorithm is allowed to select the most informative decision boundary, as these samples are likely to be
samples from an unlabeled dataset and request labels for the most difficult to classify.
those samples. Query by committee: This strategy involves training
The labeled samples are then used to update the model and multiple models on the same dataset and selecting
the process is repeated iteratively. samples that the models disagree on.
The key idea behind active learning is to select the most This is based on the assumption that the samples that
informative samples to label, rather than labeling random are difficult for one model to classify are also difficult
samples. This is done by selecting samples that the model is for other models.
uncertain about or that are representative of the underlying
distribution.
15
Density-based sampling: This strategy selects samples Uncertainty sampling: As mentioned earlier,
from regions of the feature space that are underrepresented uncertainty sampling is a common heuristic used in
in the labeled data. active learning.
This can be done using clustering techniques or other It involves selecting the samples that the model is most
density-based methods. uncertain about, in order to improve performance.
Overall, active learning can be a powerful tool for machine Query-by-committee: This heuristic involves using
learning when labeled data is limited or expensive to obtain. multiple models to evaluate the uncertainty of different
By selecting the most informative samples to label, active samples.
learning can improve model performance with fewer labeled Samples that are uncertain across multiple models are
samples, making it an efficient and effective way to learn considered more informative and are given priority for
from data. labeling.
Interactive Learning: common heuristics Diversity sampling: This heuristic involves selecting
Interactive learning involves a range of techniques and samples that are representative of different parts of the
heuristics that can be used to optimize the learning process. data distribution.
Here are some common heuristics used in interactive This can help to avoid biasF in the training data and
learning: improve model performance on new, unseen data.
Active exploration: This heuristic involves exploring the Random sampling: This heuristic involves selecting
environment to gather data and build a model. samples at random from the dataset.
It is commonly used in reinforcement learning, where an While not as effective as other heuristics, random
agent interacts with the environment to learn the best sampling can be useful for building a baseline model or
actions to take in a given situation. for exploring the dataset.
16
Active learning with expert feedback: In this heuristic, an Sampling bias can lead to inaccurate or incomplete
expert provides feedback to guide the selection of models, as the model is trained on a limited and
informative samples. biased set of data.
This can be useful in situations where the model is being For example, if the examples presented to the user for
used to solve a specific task and where the expert has domain labeling are biased towards a particular demographic
knowledge that can be leveraged to improve performance. group, the model may not perform well on data from
These are just a few of the common heuristics used in other demographic groups.
interactive learning. Similarly, if the examples presented to the user for
The choice of heuristic depends on the specific problem and labeling are biased towards a particular set of features
the data available, and often involves a trade-off between or characteristics, the model may not perform well on
exploration and exploitation, diversity and uncertainty, and data that contains different features or characteristics.
other factors. To avoid sampling bias in interactive learning, it is
Interactive Learning: Sampling bias important to ensure that the examples presented to the
Sampling bias is a common issue in machine learning, and it user for labeling or feedback are representative of the
can also affect interactive learning. population that the model will be applied to. This can
Sampling bias occurs when the examples that are used to be achieved by carefully selecting the initial set of
train or test the model are not representative of the examples, or by using sampling strategies that ensure
population that the model will be applied to. diversity and balance in the examples presented to the
In interactive learning, sampling bias can occur when the user. It is also important to monitor the performance
examples that are presented to the user for labeling or of the model over time, and to re-evaluate the
feedback are not representative of the population that the sampling strategy as new data becomes available.
model will be applied to. 17
By addressing sampling bias in interactive learning, we There are several different schemes that can be used to
can help to ensure that the resulting model is accurate, implement SDAL. One common approach is to use an
reliable, and robust in its performance. ensemble of models, each of which is trained on a
Safe Disagreement Based Active Learning different subset of the data.
Schemes: By selecting examples on which there is high
Safe Disagreement Based Active Learning (SDAL) is a disagreement between the different models, we can
type of interactive learning that aims to select identify examples that are particularly challenging for the
informative examples for the user to label or provide model and that are likely to benefit from user feedback.
feedback on, while also ensuring that the model's Another approach is to use a Bayesian model, which
predictions are safe and reliable. allows for uncertainty in the model's predictions.
SDAL is particularly useful in scenarios where incorrect By selecting examples on which the model's uncertainty
predictions could have serious consequences, such as in is high, we can identify examples that are particularly
healthcare or finance. informative for improving the model's accuracy.
The basic idea behind SDAL is to select examples on Overall, SDAL is a powerful approach to interactive
which the model's predictions are uncertain, but on learning that can help to improve the accuracy and
which there is high agreement between different versions reliability of machine learning models in high-stakes
of the model. applications.
The idea is that by selecting these examples, the user can By selecting examples that are both informative and safe,
provide feedback that will help to improve the model's we can ensure that the resulting model is well-suited to its
accuracy and reduce its uncertainty, while also ensuring intended use and can be relied upon to make accurate and
that the model's predictions are safe and reliable. trustworthy predictions.
18
Semi-supervised learning is a type of machine learning The supervised model is then fine-tuned using both the
that falls between the two main paradigms of supervised labeled and unlabeled data, with the goal of improving
learning and unsupervised learning. the model's accuracy on the labeled data.
In supervised learning, the model is trained using labeled One of the main advantages of semi-supervised learning
examples, where the correct outputs are known. is that it can help to reduce the cost and effort of labeling
In unsupervised learning, the model is trained using large amounts of data, which can be time-consuming and
unlabeled examples, where the correct outputs are not expensive.
known. It can also help to improve the accuracy of the resulting
Semi-supervised learning, on the other hand, combines model, as the model is trained on a larger and more
both labeled and unlabeled data to train a model. diverse set of data.
The basic idea behind semi-supervised learning is that However, semi-supervised learning can also be
there is often much more unlabeled data available than challenging, as it requires careful selection of the
labeled data. unlabeled data and the unsupervised model used to
By using both types of data, we can take advantage of the generate labels.
large amounts of unlabeled data to improve the accuracy In addition, semi-supervised learning algorithms can be
of the model, while still benefiting from the labeled data more complex and computationally intensive than
to guide the learning process. supervised learning algorithms, as they require additional
Semi-supervised learning algorithms typically work by steps to process and integrate the unlabeled data.
first training an unsupervised model on the unlabeled Overall, semi-supervised learning is a powerful approach
data. to machine learning that can help to improve the accuracy
This unsupervised model is then used to generate labels and efficiency of models, particularly in scenarios where
for the unlabeled data, which are then combined with the labeled data is scarce or expensive to obtain. 19
labeled data to train a supervised model.
Semi-Supervised Learning: Transductive TSVMs work by considering all possible labelings of the
SVM: unlabeled data, and then selecting the labeling that results in
Transductive Support Vector Machines (TSVMs) are the largest margin between the decision boundary and the data
a type of semi-supervised learning algorithm that points of each class.
combines ideas from supervised and unsupervised This is done using an iterative optimization procedure, which
learning to make predictions on a set of unlabeled alternates between optimizing the SVM model on the labeled
data. data, and updating the labeling of the unlabeled data based on
TSVMs are particularly useful when we have a small the current decision boundary.
amount of labeled data and a much larger amount of One of the main advantages of TSVMs is that they can be
unlabeled data. very effective in scenarios where there is a large amount of
The basic idea behind TSVMs is to use the labeled unlabeled data and only a small amount of labeled data.
data to train a Support Vector Machine (SVM) model, By using the labeled data to guide the learning process and
which is a supervised learning algorithm used for the unlabeled data to improve the accuracy of the model,
classification or regression. TSVMs can achieve better performance than fully supervised
The SVM model learns a decision boundary that learning algorithms that rely only on the labeled data.
separates the different classes in the labeled data. However, TSVMs can also be computationally expensive, as
This decision boundary is then used to predict the they require the optimization of both the SVM model and the
labels of the unlabeled data, which is done in a labeling of the unlabeled data.
transductive manner, meaning that the predictions are In addition, TSVMs assume that the labeled data is
only made on the specific set of unlabeled examples representative of the distribution of the unlabeled data, which
that are given. may not always be the case in practice.
20
Overall, TSVMs are a powerful and flexible During each iteration, the two models make predictions on the
approach to semi-supervised learning, which can unlabeled data, and the predictions that are most confident and
be used in a variety of applications where labeled consistent between the two models are used to augment the
data is scarce or expensive to obtain. labeled data for the next iteration.
CO-TRAINING: This process of selecting the most informative examples and
Co-training is another popular approach to semi- adding them to the labeled data is known as "pool-based active
supervised learning that can be used to improve learning."
the accuracy of a machine learning model by The two models are then retrained on the expanded labeled data,
leveraging both labeled and unlabeled data. and the process is repeated until the accuracy of both models
Like TSVMs, co-training is particularly useful when converges.
we have a small amount of labeled data and a Co-training can be a very effective approach to semi-supervised
much larger amount of unlabeled data. learning, as it allows us to leverage the complementary
The basic idea behind co-training is to train information in multiple views of the data to improve the accuracy
multiple models, each using a different set of of the model.
features, or views, of the data. By selecting the most informative examples from the unlabeled
These views may be obtained, for example, by data and adding them to the labeled data, co-training can also help
applying different feature extraction techniques or to reduce the amount of labeled data required to achieve high
by using different subsets of the available accuracy.
features. One of the main limitations of co-training is that it requires the
Each model is trained on a different subset of the data to be easily separable into two or more views, which may not
labeled data, and the unlabeled data is then used always be the case in practice. In addition, co-training may not be
to iteratively improve the accuracy of both as effective in scenarios where the unlabeled data is very noisy or
21
models. contains irrelevant information.
Overall, co-training is a powerful and flexible approach to RL is used in a wide range of applications, such as
semi-supervised learning that can be used in a variety of robotics, gaming, and finance, to model decision-
applications where labeled data is scarce or expensive to making problems in which the consequences of
obtain. actions are uncertain and the goal is to maximize long-
When combined with other semi-supervised learning term rewards.
techniques, such as TSVMs, co-training can help to further RL has been successfully applied in many real-world
improve the accuracy and efficiency of machine learning scenarios, such as training autonomous vehicles,
models. optimizing energy consumption in buildings, and
Reinforcement Learning (RL) developing personalized medical treatments.
Reinforcement Learning (RL) is a type of machine learning The RL framework consists of four main components:
that deals with sequential decision-making problems. the environment, the agent, the state, and the reward.
In RL, an agent interacts with an environment, takes actions The environment represents the problem being solved,
based on the current state, and receives feedback in the form of and it provides feedback to the agent in the form of
rewards. rewards.
The goal of the agent is to learn a policy, which is a mapping The agent takes actions based on the current state and
from states to actions that maximizes the expected cumulative the policy, and it receives feedback from the
reward over time. environment in the form of rewards.
The RL framework is similar to how humans learn through The state represents the current situation of the
trial and error. For example, when a child learns to walk, they environment, and it is used by the agent to make
take small steps and adjust their movements based on the decisions.
feedback received from their environment. The reward is a scalar signal that indicates the quality
Similarly, in RL, an agent takes actions and adjusts its policy of the action taken by the agent.
22
based on the rewards received from the environment.
RL algorithms can be broadly classified into model- R is the reward function, which specifies the immediate
based and model-free methods. Model-based methods reward received after taking an action in a state.
use a model of the environment to predict the outcomes γ is the discount factor, which determines the importance
of actions, while model-free methods directly estimate of future rewards.
the value of actions without modeling the environment. MDPs are characterized by the Markov property, which
Some popular RL algorithms include Q-learning, Deep states that the future state and reward depend only on the
Q-Networks (DQNs), and Policy Gradients. current state and action, and not on the history of
Overall, RL is a powerful and flexible approach to previous states and actions.
solving sequential decision-making problems. This property allows us to represent a complex decision-
By learning from experience and optimizing long-term making problem as a simple mathematical framework.
rewards, RL can help to find optimal solutions in In an MDP, an agent takes actions based on the current
complex and uncertain environments. state, receives a reward, and transitions to a new state
Markov Decision Processes (MDPs) are a formal according to the transition function.
framework used in Reinforcement Learning to model The goal of the agent is to learn a policy, which is a
sequential decision-making problems. An MDP is mapping from states to actions that maximizes the
defined as a tuple (S, A, P, R, γ), where: expected cumulative reward over time.
S is a set of states. This is known as the "optimal policy.“
A is a set of actions. The optimal policy can be found using various methods,
P is the transition function, which specifies the such as Dynamic Programming, Monte Carlo methods,
probability of transitioning from one state to another and Temporal Difference learning.
when taking an action.
23
These methods allow the agent to learn the values of The Bellman equation expresses the value of a state
different states and actions, which can be used to guide the in terms of the values of its neighboring states:
selection of actions that maximize the expected cumulative
reward. V(s) = max_a ∑_s' P(s'|s,a) [R(s,a,s') + γ V(s')]
MDPs are used in a wide range of applications, such as
robotics, gaming, and finance, to model decision-making where V(s) is the value of state s, a is the action
problems in which the consequences of actions are uncertain taken in state s, s' is the next state, P(s'|s,a) is the
and the goal is to maximize long-term rewards. probability of transitioning to s' when taking action a
By using the mathematical framework of MDPs, we can in state s, R(s,a,s') is the reward received when
formalize these problems and develop efficient algorithms to transitioning from state s to s' with action a, and γ is
find optimal solutions. the discount factor.
VALUE ITERATION: The Value Iteration algorithm starts with an initial
Value Iteration is a Dynamic Programming algorithm used estimate of the value function and iteratively updates
in Reinforcement Learning to find the optimal value the values of all states until convergence.
function and the optimal policy for a Markov Decision At each iteration, the algorithm updates the value of
Process (MDP). each state using the Bellman equation:
It is an iterative algorithm that updates the value of each
state using the Bellman equation. V_k+1(s) = max_a ∑_s' P(s'|s,a) [R(s,a,s') + γ
The value of a state in an MDP is defined as the expected V_k(s')]
cumulative reward starting from that state and following the
optimal policy. where V_k(s) is the value of state s at the k-th
iteration. 24
The algorithm continues to iterate until the values of all Q-Learning uses an iterative approach to update the Q-
states converge to their optimal values. values for each state-action pair based on the observed
Once the optimal value function is obtained, the optimal rewards.
policy can be derived by choosing the action that maximizes The Q-Learning algorithm uses an exploration-
the value of each state: exploitation trade-off strategy to balance between
exploiting the known information about the
π*(s) = argmax_a ∑_s' P(s'|s,a) [R(s,a,s') + γ V*(s')] environment and exploring new actions to gain more
information.
where π*(s) is the optimal policy in state s and V*(s) is the The algorithm maintains a table of Q-values for each
optimal value of state s. state-action pair, and it starts with an initial estimate of
Value Iteration is a powerful algorithm for solving MDPs, the Q-values.
but it can be computationally expensive for large state At each time step, the agent selects an action based on
spaces. In such cases, other methods like Policy Iteration or the current state and the Q-values, using an
Monte Carlo methods may be more suitable. exploration strategy like Epsilon-greedy or Softmax.
Q-LEARNING After taking the action, the agent observes the next
Q-Learning is a model-free Reinforcement Learning state and the reward and updates the Q-value for the
algorithm used to learn the optimal action-value function for current state-action pair using the following equation:
a Markov Decision Process (MDP).
The action-value function, denoted as Q(s,a), represents the Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') - Q(s,a)]
expected cumulative reward for taking action a in state s and
following the optimal policy thereafter.
25
where s is the current state, a is the action taken, r is the
reward received, s' is the next state, a' is the next action, γ is
the discount factor that determines the importance of future
rewards, and α is the learning rate that controls the weight of
the new information.
The Q-Learning algorithm iteratively updates the Q-values
based on the observed rewards, and the Q-values converge
to the optimal values for each state-action pair.
Once the optimal Q-values are obtained, the optimal policy
can be derived by choosing the action with the highest Q-
value in each state.
Q-Learning is a simple and powerful algorithm that can be
applied to a wide range of Reinforcement Learning
problems.
It has been successfully used in many real-world
applications, such as training autonomous agents, optimizing
control systems, and playing games like chess and Go.
However, Q-Learning may suffer from slow convergence
and suboptimal results when the state space is large or the
reward function is sparse.
In such cases, other algorithms like Deep Q-Networks
(DQNs) or Monte Carlo methods may be more suitable.
26

ML Unit V

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Unit V

Uploaded by

Copyright:

Available Formats

UNIT-V

You might also like