Professional Documents
Culture Documents
2. Nearest Neighbour
It is based on supervised learning and also called as KNN. Here we need to identify the
closest path using the distance measure like Euclidian Distance.
3. Decision Trees
Decision tree is a tree in which the branch node is representing the choice between the
numbers of alternatives. Representation of data in the form of tree to predict the unseen
circumstances.
4. Naïve Bayes
Naïve Bayes is a widely used method. In Naïve Bayes algorithm we are having a particular
dataset & it is generally used for constructing the classifiers where we are modelling and
assigning the class labels.
These are also available for classification.
• Linear Models
In linear models linear regression is very popular in statics and machine learning where the
linear relationship is considered in between the input variable and the output variable. Linear
regression is a statistical regression method which is used for predictive analysis
Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. Logistic regression is used for binary classification where we are
having data set as 0 or 1.
1. Similarity
Similarity is a machine learning method that uses a nearest neighbour approach to
identify the similarity of two or more objects to each other based on algorithmic
distance functions. In order for similarity to operate at the speed and scale of machine
learning standards, two critical capabilities are required – high-speed indexing &
metric and non-metric distance functions.
Metric Or Non-Metric Distance Functions Provides significant improvements in accuracy over
traditional Euclidean distance.
Similarity measure is a distance with dimension representing the features of the objects.
Similarity is measured in terms of x and y. If similarity 1 then we can say x=y else 0.
• Distance based methods are the machine learning algorithms that classify queries by
computing the distance between the queries and the number of stored exemplars.
Exemplars are nothing but the closures to the query that have the largest influence on the
classification assigned to the query. (Suppose we are having particular query and the
exemplars are nothing but the closest one which are having the largest inflation on the
classification which are assigned to that particular query).
Euclidean space is the fundamental space of classical geometry. Originally, it was the three-
dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean
spaces of any nonnegative integer dimension, including the three-dimensional space and
the Euclidean plane (dimension two).
In the context of Machine learning, the concept of distance is not based on merely the
physical distance between two points. Instead, we could think of the distance between two
points considering the mode of transport between two points. Travelling between two cities
by plane covers less distance physically than by train because a plane is unrestricted.
Similarly, in chess, the concept of distance depends on the piece used – for example, a
Bishop can move diagonally.
Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently.
The distance metrics commonly used are Euclidean, Minkowski, Manhattan. Distance is
applied through the concept of neighbours and exemplars. Neighbours are points in proximity
with respect to the distance measure expressed through exemplars. Exemplars are either
centroids that find a centre of mass according to a chosen distance metric or medoids that find
the most centrally located data point. The most commonly used centroid is the arithmetic
mean, which minimises squared Euclidean distance to all other points.
Notes:
• The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to any
object in n-dimensional space: its centroid is the mean position of all the points.
•Medoids are similar in concept to means or centroids. Medoids are most commonly used on
data when a mean or centroid cannot be defined. They are used in contexts where the centroid
is not representative of the dataset, such as in image data. Examples of distance-based models
include the nearest-neighbour models, which use the training data as exemplars – for
example, in classification. The K-means clustering algorithm also uses exemplars to create
clusters of similar data points.
K-Nearest Neighbour (KNN) Algorithm for Machine Learning
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories? To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
o Firstly, we will choose the number of neighbours, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbours, as three nearest
neighbours in category A and two nearest neighbours in category B. Consider the below
image:
o As we can see the 3 nearest neighbours are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but too large K might include majority points from other classes.
o Rule of thumb is K<sqrt(n), n is number of examples.
o It is simple to implement.
o It is robust to the noisy training data.
o It can be more effective if the training data is large.
o Always needs to determine the value of K which may be complex some time.
o Takes more time to classify a new example (need to calculate and compare distance from new
example to all other examples).
o Needs large number of samples for accuracy.