You are on page 1of 8

1.

Distance Based Methods

It classifies the queries by computing the distance between the queries.

In this concept Similarity, Euclidian Space, Euclidian Distance, Manhattan distance is


considered and how they are used let us see.

2. Nearest Neighbour

It is based on supervised learning and also called as KNN. Here we need to identify the
closest path using the distance measure like Euclidian Distance.

3. Decision Trees

Decision tree is a tree in which the branch node is representing the choice between the
numbers of alternatives. Representation of data in the form of tree to predict the unseen
circumstances.

4. Naïve Bayes

Naïve Bayes is a widely used method. In Naïve Bayes algorithm we are having a particular
dataset & it is generally used for constructing the classifiers where we are modelling and
assigning the class labels.
These are also available for classification.

• Linear Models

In linear models linear regression is very popular in statics and machine learning where the
linear relationship is considered in between the input variable and the output variable. Linear
regression is a statistical regression method which is used for predictive analysis

Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. Logistic regression is used for binary classification where we are
having data set as 0 or 1.

In Generalized Linear models we are having multiple regression analysis.

• Support Vector Machine is a supervised learning technique which is used in


classification and regression both. Here we have nonlinearity and kernel methods.

• Beyond Binary Classification is also known as Multiclass classification. Here we need


to divide a particular problem into multi class problem with more than two possible
discrete outcomes we need to classify the particular problem.
Distance – Based Machine Learning Methods

1. Similarity
Similarity is a machine learning method that uses a nearest neighbour approach to
identify the similarity of two or more objects to each other based on algorithmic
distance functions. In order for similarity to operate at the speed and scale of machine
learning standards, two critical capabilities are required – high-speed indexing &
metric and non-metric distance functions.
Metric Or Non-Metric Distance Functions Provides significant improvements in accuracy over
traditional Euclidean distance.

Similarity is the basic building blocks for the particular activity.

It is a measure of how much two objects are alike.

Similarity measure is a distance with dimension representing the features of the objects.

Similarity is highly dependent on domain and application.

Similarity is measured in terms of x and y. If similarity 1 then we can say x=y else 0.

• Distance based methods are the machine learning algorithms that classify queries by
computing the distance between the queries and the number of stored exemplars.

Exemplars are nothing but the closures to the query that have the largest influence on the
classification assigned to the query. (Suppose we are having particular query and the
exemplars are nothing but the closest one which are having the largest inflation on the
classification which are assigned to that particular query).
Euclidean space is the fundamental space of classical geometry. Originally, it was the three-
dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean
spaces of any nonnegative integer dimension, including the three-dimensional space and
the Euclidean plane (dimension two).

Distance-based models are the second class of Geometric models.

As the name implies, distance-based models work on the concept of distance.

In the context of Machine learning, the concept of distance is not based on merely the
physical distance between two points. Instead, we could think of the distance between two
points considering the mode of transport between two points. Travelling between two cities
by plane covers less distance physically than by train because a plane is unrestricted.
Similarly, in chess, the concept of distance depends on the piece used – for example, a
Bishop can move diagonally.

Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently.

The distance metrics commonly used are Euclidean, Minkowski, Manhattan. Distance is
applied through the concept of neighbours and exemplars. Neighbours are points in proximity
with respect to the distance measure expressed through exemplars. Exemplars are either
centroids that find a centre of mass according to a chosen distance metric or medoids that find
the most centrally located data point. The most commonly used centroid is the arithmetic
mean, which minimises squared Euclidean distance to all other points.
Notes:

• The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to any
object in n-dimensional space: its centroid is the mean position of all the points.

•Medoids are similar in concept to means or centroids. Medoids are most commonly used on
data when a mean or centroid cannot be defined. They are used in contexts where the centroid
is not representative of the dataset, such as in image data. Examples of distance-based models
include the nearest-neighbour models, which use the training data as exemplars – for
example, in classification. The K-means clustering algorithm also uses exemplars to create
clusters of similar data points.
K-Nearest Neighbour (KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories? To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbours


o Step-2: Calculate the Euclidean distance of K number of neighbours
o Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
o Step-4: Among these K neighbours, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbour is
maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below
image:

o Firstly, we will choose the number of neighbours, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbours, as three nearest
neighbours in category A and two nearest neighbours in category B. Consider the below
image:

o As we can see the 3 nearest neighbours are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but too large K might include majority points from other classes.
o Rule of thumb is K<sqrt(n), n is number of examples.

Advantages of KNN Algorithm:

o It is simple to implement.
o It is robust to the noisy training data.
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.
o Takes more time to classify a new example (need to calculate and compare distance from new
example to all other examples).
o Needs large number of samples for accuracy.

You might also like