Professional Documents
Culture Documents
UNIT- IV
Ensemble learning is one of the most powerful machine learning techniques that use the
combined output of two or more models/weak learners and solve a particular
computational intelligence problem. E.g., a Random Forest algorithm is an ensemble of
various decision trees combined.
An ensembled model is a machine learning model that combines the predictions from two or
more models.”
Hard Voting: In hard voting, the predicted output class is a class with the highest
majority of votes i.e the class which had the highest probability of being predicted by each
of the classifiers. Suppose three classifiers predicted the output class(A, A, B), so here the
majority predicted A as output. Hence A will be the final prediction.
Soft Voting: In soft voting, the output class is the prediction based on the average of
probability given to that class. Suppose given some input to three models, the prediction
probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for
UNIT-4 IV- Page 2 of 24
KCE-CSE –AI&ML 2023
class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest
probability averaged by each classifier
Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. In simple words, we can summarise the ensemble
learning as follows:
There are many ways to ensemble models in machine learning, such as Bagging,
Boosting, and stacking.
Stacking is one of the most popular ensemble machine learning techniques used to
predict multiple nodes to build a new model and improve model performance.
Stacking enables us to train multiple models to solve similar problems, and based on
their combined output, it builds a new model with improved performance.
1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised
machine learning problems. It is generally completed in two steps as follows:
Bootstrapping:
It is a random sampling method that is used to derive samples from the data using the
replacement procedure.
In this method, first, random data samples are fed to the primary model, and then a
base learning algorithm is run on the samples to complete the learning process.
Aggregation:
This is a step that involves the process of combining the output of all base models and,
based on their output, predicting an aggregate result with greater accuracy and
reduced variance.
Example: In the Random Forest method, predictions from multiple decision trees are
ensembled parallelly.
Further, in regression problems, we use an average of these predictions to get the final
output, whereas, in classification problems, the model is selected as the predicted
class.
The feature offering the best split out of the lot is used to split the nodes
The tree is grown, so you have the best root nodes
The above steps are repeated n times. It aggregates the output of individual decision
trees to give the best prediction
2. Boosting
Boosting is an ensemble method that enables each member to learn from the
preceding member's mistakes and make better predictions for the future.
Unlike the bagging method, in boosting, all base learners (weak) are arranged in a
sequential format so that they can learn from the mistakes of their preceding learner.
Hence, in this way, all weak learners get turned into strong learners and make a better
predictive model with significantly improved performance.
3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to
learn how to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.
Original data: This data is divided into n-folds and is also considered test data or
training data.
Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs, provide
the input and output pairs of the training dataset used to fit the meta-model.
Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
UNIT-4 IV- Page 7 of 24
KCE-CSE –AI&ML 2023
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not the part of
our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate
the distance between two points. So, we will draw a median between both the
centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
UNIT-4 IV- Page 8 of 24
KCE-CSE –AI&ML 2023
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points. We will repeat the process by finding the center of gravity
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:
Applications
Customer Segmentation: K-means clustering in machine learning permits marketers
to enhance their customer base, work on a target base, and segment customers based
on purchase patterns, interests, or activity monitoring. Segmentation helps companies
target specific clusters/groups of customers for particular campaigns.
Document Classification: Cluster documents in multiple categories based on tags,
topics, and content. K-means clustering in machine learning is a suitable algorithm for
this purpose. The initial processing of the documents is needed to represent each
document as a vector and uses term frequency for identifying commonly used terms
that help classify the document. The document vectors are then clustered to help
identify similarities in document groups.
Delivery store optimization: K-means clustering in machine learning helps to
optimize the process of good delivery using truck drones. K-means clustering in
machine learning helps to find the optimal number of launch locations.
Insurance fraud detection: Machine learning is critical in fraud detection and has
numerous applications in automobile, healthcare, and insurance fraud detection.
K-Means clustering in machine learning can also be used for performing image
segmentation by trying to group similar pixels in the image together and creating
clusters.
One more way to categorize Machine Learning systems is by how they generalize.
Generalization — usually refers to a ML model’s ability to perform well on new unseen data
rather than just the data that it was trained on.
Most Machine Learning tasks are about making predictions. This means that given a number
of training examples, the system needs to be able to make good “predictions for” / “generalize
to” examples it has never seen before. Having a good performance measure on the training
data is good, but insufficient; the true goal is to perform well on new instances.
There are two main approaches to generalization: instance-based learning and model-
based learning
1. Instance-based learning:
(sometimes called memory-based learning) is a family of learning algorithms that, instead of
performing explicit generalization, compares new problem instances with instances seen in
training, which have been stored in memory.
Instance-based learning systems, also known as lazy learning systems, store the entire
training dataset in memory and when a new instance is to be classified, it compares the
new instance with the stored instances and returns the most similar one. These systems
do not build a model using the training dataset.
2. Model-based learning:
Machine learning models that are parameterized with a certain number of
parameters that do not change as the size of training data changes.
If you don’t assume any distribution with a fixed number of parameters over your data,
for example, in k-nearest neighbor, or in a decision tree, where the number of
parameters grows with the size of the training data, then you are not model-based, or
nonparametric
UNIT-4 IV- Page 12 of 24
KCE-CSE –AI&ML 2023
Model-based learning systems are also known as eager learning systems, where the
model learns the training data. These systems build a machine learning model using
the entire training dataset, which is built by analyzing the training data and
identifying patterns and relationships. After that, the model can be used to make
predictions on new data.
Which is better?
In general, it’s better to choose model-based learning when the goal is to make
predictions on unseen data, and when there are enough computational resources
available. And it’s better to choose instance-based learning when the goal is to make
predictions on new data that are similar to the training instances, and when there are
limited computational resources available.
So, both instance-based and model-based learning have their own advantages and
disadvantages, and the choice between them depends on the specific problem and the
available resources.
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
Large values for K are good, but it may find some difficulties.
The computation cost is high because of calculating the distance between the data
points for all the training samples.
We have data from the questionnaires survey and objective testing with two attributes (acid
durability and strength) to classify whether a special paper tissue is good or not. Here are the
training samples
Now, the factory produces a new paper tissue that pass lab test with x1=3, x2=7. Without
another expensive survey, find the classification of this new tissue
1. Determine K eg:3
2. Calculate the distance between the query-instance and all the training samples . Here
instance is (3,7)
7 4 (7-3)2 +(4-7)2=25 4 No
7 4 (7-3)2 +(4-7)2=25 4 No -
Use simple majority of the category and predict instance. Here Y=Good (2 votes)
Eg 2
Suppose we have height, weight and T-shirt size of some customers and we need to
predict the T-shirt size of a new customer given only height and weight information we
have. Data including height, weight and T-shirt size information is shown. New customer
named 'Monica' has height 161cm and weight 61kg. Find the T-shirt size.
Eg:3
The table above represents our data set. We have two columns — Brightness and Saturation.
Each row in the table has a class of either Red or Blue.
We have a new entry but it doesn't have a class yet. To know its class, we have to calculate the
distance from the new entry to other entries in the data set using the Euclidean distance
formula.
Where:
Distance #1
For the first row, d1:
We now know the distance from the new data entry to the first entry in the table. Let's update
the table.
Distance #2
For the second row, d2:
Table will look like after all the distances have been calculated:
Since we chose 5 as the value of K, we'll only consider the first five rows. That is:
As you can see above, the majority class within the 5 nearest neighbors to the new entry is
Red. Therefore, we'll classify the new entry as Red.
UNIT-4 IV- Page 20 of 24
KCE-CSE –AI&ML 2023
Choosing a very low value will most likely lead to inaccurate predictions.
The commonly used value of K is 5.
Always use an odd number as the value of K.
Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used
to classify data into different categories based on the probability distribution.
Gaussian mixture models can be used in many different areas, including finance,
marketing and so much more!
Gaussian mixture models (GMM) are a probabilistic concept used to model real-world data
sets. GMMs are a generalization of Gaussian distributions and can be used to represent
any data set that can be clustered into multiple Gaussian distributions.
The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
A Gaussian mixture model can be used for clustering, which is the task of grouping a set
of data points into clusters.
GMMs can be used to find clusters in data sets where the clusters may not be clearly
defined. Additionally, GMMs can be used to estimate the probability that a new data point
belongs to each cluster.
Gaussian mixture models are also relatively robust to outliers, meaning that they can
still yield accurate results even if there are some data points that do not fit neatly into any
of the clusters. This makes GMMs a flexible and powerful tool for clustering data.
It can be understood as a probabilistic model where Gaussian distributions are assumed
UNIT-4 IV- Page 21 of 24
KCE-CSE –AI&ML 2023
for each group and they have means and covariances which define their parameters.
GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian
distribution is defined as a continuous probability distribution that takes on a bell-shaped
curve.
Another name for Gaussian distribution is the normal distribution. Here is a picture of
Gaussian mixture models:
GMM has many applications, such as density estimation, clustering, and image
segmentation.
For density estimation, GMM can be used to estimate the probability density function of a
set of data points.
For clustering, GMM can be used to group together data points that come from the same
Gaussian distribution. And for image segmentation, GMM can be used to partition an
image into different regions.
Gaussian mixture models can be used for a variety of use cases, including identifying
customer segments, detecting fraudulent activity, and clustering images.
In each of these examples, the Gaussian mixture model is able to identify clusters in the
data that may not be immediately obvious.
As a result, Gaussian mixture models are a powerful tool for data analysis and should be
considered for any clustering task.
The following are three different steps to using gaussian mixture models:
Determining a covariance matrix that defines how each Gaussian is related to one another.
The more similar two Gaussians are, the closer their means will be and vice versa if they
are far away from each other in terms of similarity.
A gaussian mixture model can have a covariance matrix that is diagonal or symmetric.
Determining the number of Gaussians in each group defines how many clusters there are.
Selecting the hyperparameters which define how to optimally separate data using
gaussian mixture models as well as deciding on whether or not each gaussian’s covariance
matrix is diagonal or symmetric.
Application areas
There are many different real-world problems that can be solved with gaussian mixture
models. Gaussian mixture models are very useful when there are large datasets and it is
difficult to find clusters. This is where Gaussian mixture models help. It is able to find
clusters of Gaussians more efficiently than other clustering algorithms such as k-means.
Here are some real-world problems which can be solved using Gaussian mixture models:
Finding patterns in medical datasets: GMMs can be used for segmenting images into
multiple categories based on their content or finding specific patterns in medical datasets.
They can be used to find clusters of patients with similar symptoms, identify disease
subtypes, and even predict outcomes. In one recent study, a Gaussian mixture model was
used to analyze a dataset of over 700,000 patient records. The model was able to identify
previously unknown patterns in the data, which could lead to better treatment for patients
with cancer.
Modeling natural phenomena: GMM can be used to model natural phenomena where it
has been found that noise follows Gaussian distributions. This model of probabilistic
modeling relies on the assumption that there exists some underlying continuum of
unobserved entities or attributes and that each member is associated with measurements
taken at equidistant points in multiple observation sessions.
Customer behavior analysis: GMMs can be used for performing customer behavior
analysis in marketing to make predictions about future purchases based on historical data.
Stock price prediction: Another area Gaussian mixture models are used is in finance
where they can be applied to a stock’s price time series. GMMs can be used to detect
changepoints in time series data and help find turning points of stock prices or other
market movements that are otherwise difficult to spot due to volatility and noise.
Gene expression data analysis: Gaussian mixture models can be used for gene
expression data analysis. In particular, GMMs can be used to detect differentially
UNIT-4 IV- Page 23 of 24
KCE-CSE –AI&ML 2023
expressed genes between two conditions and identify which genes might contribute
toward a certain phenotype or disease state.
What are the differences between Gaussian mixture models and other types of
clustering algorithms such as K-means?
Here are some of the key differences between Gaussian mixture models and the K-means
algorithm used for clustering:
A Gaussian mixture model is a type of clustering algorithm that assumes that the data
point is generated from a mixture of Gaussian distributions with unknown parameters.
The goal of the algorithm is to estimate the parameters of the Gaussian distributions, as
well as the proportion of data points that come from each distribution. In contrast, K-
means is a clustering algorithm that does not make any assumptions about the underlying
distribution of the data points. Instead, it simply partitions the data points into K clusters,
where each cluster is defined by its centroid.
While Gaussian mixture models are more flexible, they can be more difficult to train than
K-means. K-means is typically faster to converge and so may be preferred in cases where
the runtime is an important consideration.
In general, K-means will be faster and more accurate when the data set is large and the
clusters are well-separated. Gaussian mixture models will be more accurate when the data
set is small or the clusters are not well-separated.
Gaussian mixture models take into account the variance of the data, whereas K-means
does not.
Gaussian mixture models are more flexible in terms of the shape of the clusters, whereas
K-means is limited to spherical clusters.
Gaussian mixture models can handle missing data, whereas K-means cannot. This
difference can make Gaussian mixture models more effective in certain applications, such
as data with a lot of noise or data that is not well-defined.