You are on page 1of 7

The Islamia University of Bahawalpur

Department: BS IT 7th Semester Morning (A)

Assignment: • EM Algorithm
• K-Nearest Neighbors (KNN)

Subject: Machine Learning


(INFT- 4028)

Submitted to: Dr. Qurat ul Ain

Submitted by: Group # 05


Areeba Samar 1003
Khadija Sajid 1009
Amna Mustafa 1015
Zoha Azmat 1021
Muhammad Awais 1043
Muhammad Hamza Zahid 1071
EM Algorithm

In machine learning, the Expectation-Maximization (EM) algorithm is a fundamental technique


used for unsupervised learning, particularly in situations where the data has missing or hidden
variables. Here's a more detailed presentation of the EM algorithm in the context of machine
learning:

1. Problem Setting:
Incomplete Data: We have a dataset with missing or hidden variables. The goal is to estimate
the parameters of a statistical model that best describes the observed data.
Model Assumption: We assume a probabilistic model that describes how the observed data is
generated from the hidden variables and model parameters.

2. E-step (Expectation Step):


Compute Expected Values: Given the current estimates of the model parameters, compute the
expected values of the missing or latent variables using the observed data.
Use Conditional Probabilities: Utilize conditional probabilities to estimate the distribution of
the missing or latent variables given the observed data and current parameter estimates.

3. M-step (Maximization Step):


Update Parameters: Update the model parameters to maximize the likelihood of the observed
data, taking into account the expected values of the missing or latent variables computed in the
E-step.
Optimization Techniques: This step often involves solving optimization problems to find the
parameter values that maximize the likelihood function, typically using techniques like gradient
ascent.

4. Iterative Process:
Iterate: Alternate between the E-step and M-step until the algorithm converges, i.e., until the
parameter estimates stabilize and no longer change significantly between iterations.

5. Applications in Machine Learning:


Clustering: EM algorithm is commonly used for clustering algorithms like Gaussian Mixture
Models (GMMs), where each cluster is modeled as a Gaussian distribution, and the EM algorithm
is used to estimate the parameters of the Gaussians and the assignment of data points to clusters.
Latent Variable Models: EM algorithm is also used in latent variable models like Hidden Markov
Models (HMMs) and Latent Dirichlet Allocation (LDA), where there are hidden variables that
govern the generation of the observed data, and the EM algorithm is used to infer the hidden
variables and estimate the model parameters.
Missing Data Imputation: EM algorithm can be used for imputing missing values in datasets by
modeling the data generation process and estimating the missing values using the observed
data.

6. Considerations:
Initialization: The performance of the EM algorithm can be sensitive to the initial parameter
values, and it is often beneficial to use multiple initializations or more sophisticated techniques
like EM with random restarts to avoid local optima.
Convergence: While the EM algorithm is guaranteed to converge to a local maximum of the
likelihood function, it may not always find the global maximum, so careful initialization and
multiple runs may be required to achieve good results.
Overall, the EM algorithm provides a powerful framework for estimating the parameters of
complex probabilistic models in machine learning, particularly in situations involving missing or
hidden variables, and it is widely used in various applications like clustering, latent variable
modeling, and missing data imputation.

K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is a robust and intuitive machine learning method
employed to tackle classification and regression problems. By capitalizing on the concept of
similarity, KNN predicts the label or value of a new data point by considering its K closest
neighbours in the training dataset. In this article, we will learn about a supervised learning
algorithm (KNN) or the k – Nearest Neighbours, highlighting it’s user-friendly nature.

What is the K-Nearest Neighbors Algorithm?

K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense application in
pattern recognition, data mining, and intrusion detection.

It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not


make any underlying assumptions about the distribution of data (as opposed to other algorithms
such as GMM, which assume a Gaussian distribution of the given data). We are given some
prior data (also called training data), which classifies coordinates into groups identified by an
attribute.

Why do we need a KNN algorithm?

(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily
used for its simplicity and ease of implementation. It does not require any assumptions about
the underlying data distribution. It can also handle both numerical and categorical data, making
it a flexible choice for various types of datasets in classification and regression tasks. It is a
non-parametric method that makes predictions based on the similarity of data points in a given
dataset. K-NN is less sensitive to outliers compared to other algorithms.

The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a
distance metric, such as Euclidean distance. The class or value of the data point is then
determined by the majority vote or average of the K neighbors. This approach allows the
algorithm to adapt to different patterns and make predictions based on the local structure of the
data.

Distance Metrics Used in KNN Algorithm:

As we know that the KNN algorithm helps us identify the nearest points or the groups for a
query point. But to determine the closest groups or the nearest points for a query point we need
some metric. For this purpose, we use below distance metrics:

1. Euclidean Distance:

This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight line
that joins the two points which are into consideration. This metric helps us calculate the net
displacement done between the two states of an object.

2. Manhattan Distance:

Manhattan Distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-dimensions.
3. Workings of KNN algorithm:

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K
nearest neighbors in the training dataset.

To make predictions, the algorithm calculates the distance between each new data point in the
test dataset and all the data points in the training dataset. The Euclidean distance is a commonly
used distance metric in K-NN, but other distance metrics, such as Manhattan distance or
Minkowski distance, can also be used depending on the problem and data. Once theDistances
between the new data point and all the data points in the training dataset are calculated, the
algorithm proceeds to find the K nearest neighbors based on these distances. Thе specific
method for selecting the nearest neighbors can vary, but a common approach is to sort the
distances in ascending order and choose the K data points with the shortest distances.

After identifying the K nearest neighbors, the algorithm makes predictions based on the labels
or values associated with these neighbors. For classification tasks, the majority class among the
K neighbors is assigned as the predicted label for the new data point. For regression tasks, the
average or weighted average of the values of the K neighbors is assigned as the predicted value.

Let X be the training dataset with n data points, where each data point is represented by a
ddimensional feature vector

X_i and Y be the corresponding labels or values for each data point in X.Given a new data
point x, the algorithm calculates the distance between x and each data point

X_i in X using a distance metric, such as Euclidean distance:


.
Pros and Cons:

Pros: Simple, easy to understand, and doesn't make strong assumptions about the underlying
data distribution.
Cons: Computationally expensive for large datasets, sensitive to irrelevant features, and may
struggle with high-dimensional data. Normalization:

Scaling features can be important for KNN since it's distance-based. Features with larger scales
might dominate the distance calculations.
KNN is often used in scenarios where the decision boundaries are complex and not easily
defined by a mathematical formula. It's a non-parametric and instance-based learning
algorithm, meaning it doesn't explicitly learn a model during training; instead, it memorizes the
training data and makes predictions based on similarity in the input space.

You might also like