You are on page 1of 24

KCE-CSE –AI&ML 2023

UNIT- IV

ENSEMBLE LEARNING AND UNSUPERVISED LEARNING

S.NO TOPICS PAGE NO. STATUS OF


COVERAGE

19 Combining multiple learners:


Model combination schemes,
Voting
Ensemble Learning - bagging,
20 boosting, stacking

21 Unsupervised learning: K-means

Instance Based Learning: KNN


22

Gaussian mixture models and


23 Expectation maximization

UNIT-4 IV- Page 1 of 24


KCE-CSE –AI&ML 2023

TOPIC19. COMBINING MULTIPLE LEARNERS: MODEL COMBINATION SCHEMES, VOTING

Ensemble learning is one of the most powerful machine learning techniques that use the
combined output of two or more models/weak learners and solve a particular
computational intelligence problem. E.g., a Random Forest algorithm is an ensemble of
various decision trees combined.

An ensembled model is a machine learning model that combines the predictions from two or
more models.”

 A Voting Classifier is a machine learning model that trains on an ensemble of numerous


models and predicts an output (class) based on their highest probability of chosen
class as the output.
 It simply aggregates the findings of each classifier passed into Voting Classifier and
predicts the output class based on the highest majority of voting.
 The idea is instead of creating separate dedicated models and finding the accuracy for
each them, we create a single model which trains by these models and predicts output
based on their combined majority of voting for each output class.

Voting Classifier supports two types of votings.

 Hard Voting: In hard voting, the predicted output class is a class with the highest
majority of votes i.e the class which had the highest probability of being predicted by each
of the classifiers. Suppose three classifiers predicted the output class(A, A, B), so here the
majority predicted A as output. Hence A will be the final prediction.
 Soft Voting: In soft voting, the output class is the prediction based on the average of
probability given to that class. Suppose given some input to three models, the prediction
probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for
UNIT-4 IV- Page 2 of 24
KCE-CSE –AI&ML 2023

class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest
probability averaged by each classifier

TOPIC 20. ENSEMBLE LEARNING - BAGGING, BOOSTING, STACKING

Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. In simple words, we can summarise the ensemble
learning as follows:

 There are many ways to ensemble models in machine learning, such as Bagging,
Boosting, and stacking.
 Stacking is one of the most popular ensemble machine learning techniques used to
predict multiple nodes to build a new model and improve model performance.
 Stacking enables us to train multiple models to solve similar problems, and based on
their combined output, it builds a new model with improved performance.

1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised
machine learning problems. It is generally completed in two steps as follows:

Bootstrapping:
 It is a random sampling method that is used to derive samples from the data using the
replacement procedure.
 In this method, first, random data samples are fed to the primary model, and then a
base learning algorithm is run on the samples to complete the learning process.
Aggregation:
 This is a step that involves the process of combining the output of all base models and,
based on their output, predicting an aggregate result with greater accuracy and
reduced variance.
 Example: In the Random Forest method, predictions from multiple decision trees are
ensembled parallelly.
 Further, in regression problems, we use an average of these predictions to get the final
output, whereas, in classification problems, the model is selected as the predicted
class.

Steps to Perform Bagging


 Consider there are n observations and m features in the training set. You need to select
a random sample from the training dataset without replacement
 A subset of m features is chosen randomly to create a model using sample observations

UNIT-4 IV- Page 3 of 24


KCE-CSE –AI&ML 2023

 The feature offering the best split out of the lot is used to split the nodes
 The tree is grown, so you have the best root nodes
 The above steps are repeated n times. It aggregates the output of individual decision
trees to give the best prediction

Advantages of Bagging in Machine Learning


 Bagging minimizes the overfitting of data
 It improves the model’s accuracy
 It deals with higher dimensional data efficiently

2. Boosting
 Boosting is an ensemble method that enables each member to learn from the
preceding member's mistakes and make better predictions for the future.
 Unlike the bagging method, in boosting, all base learners (weak) are arranged in a
sequential format so that they can learn from the mistakes of their preceding learner.
 Hence, in this way, all weak learners get turned into strong learners and make a better
predictive model with significantly improved performance.

The process for building one sample can be summarized as follows:

 Choose the size of the sample.


 While the size of the sample is less than the chosen size
 Randomly select an observation from the dataset
 Add it to the sample
 The bootstrap method can be used to estimate a quantity of a population. This is done
by repeatedly taking small samples, calculating the statistic, and taking the average of
the calculated statistics.

We can summarize this procedure as follows:

 Choose a number of bootstrap samples to perform


 Choose a sample size
 For each bootstrap sample
 Draw a sample with replacement with the chosen size
 Calculate the statistic on the sample
 Calculate the mean of the calculated sample statistics.

UNIT-4 IV- Page 4 of 24


KCE-CSE –AI&ML 2023

3. Stacking
 Stacking is one of the popular ensemble modeling techniques in machine learning.
 Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
 This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be
achieved.
 In stacking, an algorithm takes the outputs of sub-models as input and attempts to
learn how to best combine the input predictions to make a better output prediction.
 Stacking is also known as a stacked generalization and is an extended form of the
Model Averaging Ensemble technique in which all sub-models equally participate as
per their performance weights and build a new model with better predictions. This
new model is stacked up on top of the others; this is the reason why it is named
stacking.

UNIT-4 IV- Page 5 of 24


KCE-CSE –AI&ML 2023

 Original data: This data is divided into n-folds and is also considered test data or
training data.
 Base models: These models are also referred to as level-0 models. These models use
training data and provide compiled predictions (level-0) as an output.
 Level-0 Predictions: Each base model is triggered on some training data and provides
different predictions, which are known as level-0 predictions.
 Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
 Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs, provide
the input and output pairs of the training dataset used to fit the meta-model.

Steps to implement Stacking models:


1. Split training data sets into n-folds using the RepeatedStratifiedKFold as this is the
most common approach to preparing training datasets for meta-models.
2. Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
3. The prediction made in the above step is added to the x1_train list.
4. Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
5. Now, the model is trained on all the n parts, which will make predictions for the
sample data.
6. Add this prediction to the y1_test list.
7. In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using Model 2
and 3 for training, respectively, to get Level 2 predictions.
8. Now train the Meta model on level 1 prediction, where these predictions will be used
as features for the model.
Finally, Meta learners can now be used to make a prediction on test data in the
stacking model.

TOPIC 21. UNSUPERVISED LEARNING: K-MEANS


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.

What is K-Means Algorithm?


 K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters.
UNIT-4 IV- Page 6 of 24
KCE-CSE –AI&ML 2023

 Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
 It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
 It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
 It is a centroid-based algorithm, where each cluster is associated with a centroid.
 The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
 Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
UNIT-4 IV- Page 7 of 24
KCE-CSE –AI&ML 2023

centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
 We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point.
 So, here we are selecting the below two points as k points, which are not the part of
our dataset. Consider the below image:

 Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate
the distance between two points. So, we will draw a median between both the
centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
UNIT-4 IV- Page 8 of 24
KCE-CSE –AI&ML 2023

them as blue and yellow for clear visualization.

 As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points. We will repeat the process by finding the center of gravity

UNIT-4 IV- Page 9 of 24


KCE-CSE –AI&ML 2023

of centroids, so the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:

UNIT-4 IV- Page 10 of 24


KCE-CSE –AI&ML 2023

K Means Clustering working

Refer handwritten notes for solved example.

Limitations of K-Means clustering


1. When the numbers of data are not so many, initial grouping will determine the cluster
significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it does
not yield the same result with each run, since the resulting clusters depend on the initial
random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a different
order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different result
of cluster. The algorithm may be trapped in the local optimum.

Applications
 Customer Segmentation: K-means clustering in machine learning permits marketers
to enhance their customer base, work on a target base, and segment customers based
on purchase patterns, interests, or activity monitoring. Segmentation helps companies
target specific clusters/groups of customers for particular campaigns.
 Document Classification: Cluster documents in multiple categories based on tags,
topics, and content. K-means clustering in machine learning is a suitable algorithm for
this purpose. The initial processing of the documents is needed to represent each
document as a vector and uses term frequency for identifying commonly used terms
that help classify the document. The document vectors are then clustered to help
identify similarities in document groups.
 Delivery store optimization: K-means clustering in machine learning helps to
optimize the process of good delivery using truck drones. K-means clustering in
machine learning helps to find the optimal number of launch locations.
 Insurance fraud detection: Machine learning is critical in fraud detection and has
numerous applications in automobile, healthcare, and insurance fraud detection.
 K-Means clustering in machine learning can also be used for performing image
segmentation by trying to group similar pixels in the image together and creating
clusters.

UNIT-4 IV- Page 11 of 24


KCE-CSE –AI&ML 2023

TOPIC 22. INSTANCE BASED LEARNING: KNN

One more way to categorize Machine Learning systems is by how they generalize.
Generalization — usually refers to a ML model’s ability to perform well on new unseen data
rather than just the data that it was trained on.

Most Machine Learning tasks are about making predictions. This means that given a number
of training examples, the system needs to be able to make good “predictions for” / “generalize
to” examples it has never seen before. Having a good performance measure on the training
data is good, but insufficient; the true goal is to perform well on new instances.

There are two main approaches to generalization: instance-based learning and model-
based learning

1. Instance-based learning:
(sometimes called memory-based learning) is a family of learning algorithms that, instead of
performing explicit generalization, compares new problem instances with instances seen in
training, which have been stored in memory.

Instance-based learning systems, also known as lazy learning systems, store the entire
training dataset in memory and when a new instance is to be classified, it compares the
new instance with the stored instances and returns the most similar one. These systems
do not build a model using the training dataset.

Examples of instance-based learning algorithms


K-Nearest Neighbors – This algorithm classifies a data point based on the majority class of
its k nearest neighbors.
Locally Weighted Regression: This algorithm is used for regression problems and creates a
linear model around the test data point by giving more weight to the training instances that
are closer to the test instance.

2. Model-based learning:
 Machine learning models that are parameterized with a certain number of
parameters that do not change as the size of training data changes.

 If you don’t assume any distribution with a fixed number of parameters over your data,
for example, in k-nearest neighbor, or in a decision tree, where the number of
parameters grows with the size of the training data, then you are not model-based, or
nonparametric
UNIT-4 IV- Page 12 of 24
KCE-CSE –AI&ML 2023

 Model-based learning systems are also known as eager learning systems, where the
model learns the training data. These systems build a machine learning model using
the entire training dataset, which is built by analyzing the training data and
identifying patterns and relationships. After that, the model can be used to make
predictions on new data.

Examples of model-based learning algorithms


Linear Regression: This algorithm is used for predicting continuous variables and assumes
that the relationship between the input and output variables is linear.
Logistic Regression: This algorithm is used for predicting binary outcomes and is based on
the logistic function.
Decision Trees: This algorithm is used for both classification and regression problems and is
based on a tree-like structure where each internal node represents a feature, and each leaf
node represents a class label or a predicted value.

Which is better?
 In general, it’s better to choose model-based learning when the goal is to make
predictions on unseen data, and when there are enough computational resources
available. And it’s better to choose instance-based learning when the goal is to make
predictions on new data that are similar to the training instances, and when there are
limited computational resources available.

 Also, instance-based learning systems have a lower training time compared to


model-based learning systems, but a higher prediction time. Model-based learning
systems have a higher training time compared to instance-based learning systems, but
a lower prediction time.

 So, both instance-based and model-based learning have their own advantages and
disadvantages, and the choice between them depends on the specific problem and the
available resources.

KNN K-Nearest Neighbour

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
 K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training
UNIT-4 IV- Page 13 of 24
KCE-CSE –AI&ML 2023

set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


 Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories.
 To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we
can easily identify the category or class of a particular dataset.
 Consider the below diagram

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors


Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready.

UNIT-4 IV- Page 14 of 24


KCE-CSE –AI&ML 2023

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN
algorithm

There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


 Always needs to determine the value of K which may be complex some time.
UNIT-4 IV- Page 15 of 24
KCE-CSE –AI&ML 2023

 The computation cost is high because of calculating the distance between the data
points for all the training samples.

We have data from the questionnaires survey and objective testing with two attributes (acid
durability and strength) to classify whether a special paper tissue is good or not. Here are the
training samples

Now, the factory produces a new paper tissue that pass lab test with x1=3, x2=7. Without
another expensive survey, find the classification of this new tissue

1. Determine K eg:3
2. Calculate the distance between the query-instance and all the training samples . Here
instance is (3,7)

X1= Acid X2=strength Square distance for (3,7)


Durability (kb/sq.m)
(seconds)
7 7 (7-3)2 +(7-7)2=16
7 4 (7-3)2 +(4-7)2=25
3 4 (3-3)2 +(4-7)2=9
1 4 (1-3)2 +(4-7)2=13
3. Sort the distance and determine nearest neighbors based on the k-th minimum

X1 X2 Square distance for (3,7) Rank Is it included in


minimum 3nearest
distance neighbor
7 7 (7-3)2 +(7-7)2=16 3 Yes

7 4 (7-3)2 +(4-7)2=25 4 No

3 4 (3-3)2 +(4-7)2=9 1 Yes

1 4 (1-3)2 +(4-7)2=13 2 Yes

4. Gather the category Y of Nearest Neighbors

UNIT-4 IV- Page 16 of 24


KCE-CSE –AI&ML 2023

X1 X2 Square distance for (3,7) Rank Is it included in Y


minimum 3nearest neighbor
distance
7 7 (7-3)2 +(7-7)2=16 3 Yes Bad

7 4 (7-3)2 +(4-7)2=25 4 No -

3 4 (3-3)2 +(4-7)2=9 1 Yes Good

1 4 (1-3)2 +(4-7)2=13 2 Yes Good

Use simple majority of the category and predict instance. Here Y=Good (2 votes)

Eg 2
Suppose we have height, weight and T-shirt size of some customers and we need to
predict the T-shirt size of a new customer given only height and weight information we
have. Data including height, weight and T-shirt size information is shown. New customer
named 'Monica' has height 161cm and weight 61kg. Find the T-shirt size.

UNIT-4 IV- Page 17 of 24


KCE-CSE –AI&ML 2023

Eg:3

The table above represents our data set. We have two columns — Brightness and Saturation.
Each row in the table has a class of either Red or Blue.

Let's assume the value of K is 5.

How to Calculate Euclidean Distance in the K-Nearest Neighbors Algorithm

We have a new entry but it doesn't have a class yet. To know its class, we have to calculate the
distance from the new entry to other entries in the data set using the Euclidean distance
formula.

Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)²

Where:

X₂ = New entry's brightness (20).


X₁= Existing entry's brightness.
Y₂ = New entry's saturation (35).
Y₁ = Existing entry's saturation.
Let's do the calculation together. I'll calculate the first three.

Distance #1
For the first row, d1:

UNIT-4 IV- Page 18 of 24


KCE-CSE –AI&ML 2023

d1 = √(20 - 40)² + (35 - 20)²


= √400 + 225
= √625
= 25

We now know the distance from the new data entry to the first entry in the table. Let's update
the table.

Distance #2
For the second row, d2:

d2 = √(20 - 50)² + (35 - 50)²


= √900 + 225
= √1125
= 33.54
Here's the table with the updated distance:

UNIT-4 IV- Page 19 of 24


KCE-CSE –AI&ML 2023

Table will look like after all the distances have been calculated:

Let's rearrange the distances in ascending order:

Since we chose 5 as the value of K, we'll only consider the first five rows. That is:

As you can see above, the majority class within the 5 nearest neighbors to the new entry is
Red. Therefore, we'll classify the new entry as Red.
UNIT-4 IV- Page 20 of 24
KCE-CSE –AI&ML 2023

Here's the updated table:

How to Choose the Value of K in the K-NN Algorithm

 Choosing a very low value will most likely lead to inaccurate predictions.
 The commonly used value of K is 5.
 Always use an odd number as the value of K.

TOPIC 23. GAUSSIAN MIXTURE MODELS AND EXPECTATION MAXIMIZATION

 Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used
to classify data into different categories based on the probability distribution.
 Gaussian mixture models can be used in many different areas, including finance,
marketing and so much more!

 Gaussian mixture models (GMM) are a probabilistic concept used to model real-world data
sets. GMMs are a generalization of Gaussian distributions and can be used to represent
any data set that can be clustered into multiple Gaussian distributions.
 The Gaussian mixture model is a probabilistic model that assumes all the data points
are generated from a mix of Gaussian distributions with unknown parameters.
 A Gaussian mixture model can be used for clustering, which is the task of grouping a set
of data points into clusters.
 GMMs can be used to find clusters in data sets where the clusters may not be clearly
defined. Additionally, GMMs can be used to estimate the probability that a new data point
belongs to each cluster.
 Gaussian mixture models are also relatively robust to outliers, meaning that they can
still yield accurate results even if there are some data points that do not fit neatly into any
of the clusters. This makes GMMs a flexible and powerful tool for clustering data.
 It can be understood as a probabilistic model where Gaussian distributions are assumed
UNIT-4 IV- Page 21 of 24
KCE-CSE –AI&ML 2023

for each group and they have means and covariances which define their parameters.
 GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ). A Gaussian
distribution is defined as a continuous probability distribution that takes on a bell-shaped
curve.
 Another name for Gaussian distribution is the normal distribution. Here is a picture of
Gaussian mixture models:

 GMM has many applications, such as density estimation, clustering, and image
segmentation.
 For density estimation, GMM can be used to estimate the probability density function of a
set of data points.
 For clustering, GMM can be used to group together data points that come from the same
Gaussian distribution. And for image segmentation, GMM can be used to partition an
image into different regions.
 Gaussian mixture models can be used for a variety of use cases, including identifying
customer segments, detecting fraudulent activity, and clustering images.
 In each of these examples, the Gaussian mixture model is able to identify clusters in the
data that may not be immediately obvious.
 As a result, Gaussian mixture models are a powerful tool for data analysis and should be
considered for any clustering task.

What is the expectation-maximization (EM) method in relation to GMM?


 In Gaussian mixture models, an expectation-maximization method is a powerful tool for
estimating the parameters of a Gaussian mixture model (GMM). The expectation is
termed E and maximization is termed M.
 Expectation is used to find the Gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed M and it is involved in
determining whether new data points can be added or not.
 The expectation-maximization method is a two-step iterative algorithm that alternates
between performing an expectation step, in which we compute expectations for each
data point using current parameter estimates and then maximize these to produce a
new gaussian, followed by a maximization step where we update our gaussian
means based on the maximum likelihood estimate.
 The EM method works by first initializing the parameters of the GMM, then iteratively
improving these estimates. At each iteration, the expectation step calculates the
expectation of the log-likelihood function with respect to the current parameters. This
expectation is then used to maximize the likelihood in the maximization step. The process
UNIT-4 IV- Page 22 of 24
KCE-CSE –AI&ML 2023

is then repeated until convergence.


 Here is a picture representing the two-step iterative aspect of the algorithm

The following are three different steps to using gaussian mixture models:

 Determining a covariance matrix that defines how each Gaussian is related to one another.
The more similar two Gaussians are, the closer their means will be and vice versa if they
are far away from each other in terms of similarity.
 A gaussian mixture model can have a covariance matrix that is diagonal or symmetric.
 Determining the number of Gaussians in each group defines how many clusters there are.
 Selecting the hyperparameters which define how to optimally separate data using
gaussian mixture models as well as deciding on whether or not each gaussian’s covariance
matrix is diagonal or symmetric.

Application areas
There are many different real-world problems that can be solved with gaussian mixture
models. Gaussian mixture models are very useful when there are large datasets and it is
difficult to find clusters. This is where Gaussian mixture models help. It is able to find
clusters of Gaussians more efficiently than other clustering algorithms such as k-means.

Here are some real-world problems which can be solved using Gaussian mixture models:

 Finding patterns in medical datasets: GMMs can be used for segmenting images into
multiple categories based on their content or finding specific patterns in medical datasets.
They can be used to find clusters of patients with similar symptoms, identify disease
subtypes, and even predict outcomes. In one recent study, a Gaussian mixture model was
used to analyze a dataset of over 700,000 patient records. The model was able to identify
previously unknown patterns in the data, which could lead to better treatment for patients
with cancer.
 Modeling natural phenomena: GMM can be used to model natural phenomena where it
has been found that noise follows Gaussian distributions. This model of probabilistic
modeling relies on the assumption that there exists some underlying continuum of
unobserved entities or attributes and that each member is associated with measurements
taken at equidistant points in multiple observation sessions.
 Customer behavior analysis: GMMs can be used for performing customer behavior
analysis in marketing to make predictions about future purchases based on historical data.
 Stock price prediction: Another area Gaussian mixture models are used is in finance
where they can be applied to a stock’s price time series. GMMs can be used to detect
changepoints in time series data and help find turning points of stock prices or other
market movements that are otherwise difficult to spot due to volatility and noise.
 Gene expression data analysis: Gaussian mixture models can be used for gene
expression data analysis. In particular, GMMs can be used to detect differentially
UNIT-4 IV- Page 23 of 24
KCE-CSE –AI&ML 2023

expressed genes between two conditions and identify which genes might contribute
toward a certain phenotype or disease state.

What are the differences between Gaussian mixture models and other types of
clustering algorithms such as K-means?
Here are some of the key differences between Gaussian mixture models and the K-means
algorithm used for clustering:
 A Gaussian mixture model is a type of clustering algorithm that assumes that the data
point is generated from a mixture of Gaussian distributions with unknown parameters.
The goal of the algorithm is to estimate the parameters of the Gaussian distributions, as
well as the proportion of data points that come from each distribution. In contrast, K-
means is a clustering algorithm that does not make any assumptions about the underlying
distribution of the data points. Instead, it simply partitions the data points into K clusters,
where each cluster is defined by its centroid.
 While Gaussian mixture models are more flexible, they can be more difficult to train than
K-means. K-means is typically faster to converge and so may be preferred in cases where
the runtime is an important consideration.
 In general, K-means will be faster and more accurate when the data set is large and the
clusters are well-separated. Gaussian mixture models will be more accurate when the data
set is small or the clusters are not well-separated.
 Gaussian mixture models take into account the variance of the data, whereas K-means
does not.
 Gaussian mixture models are more flexible in terms of the shape of the clusters, whereas
K-means is limited to spherical clusters.
 Gaussian mixture models can handle missing data, whereas K-means cannot. This
difference can make Gaussian mixture models more effective in certain applications, such
as data with a lot of noise or data that is not well-defined.

UNIT-4 IV- Page 24 of 24

You might also like