You are on page 1of 12

CLASSIFICATION

K-NEAREST NEIGHBOR & RANDOM FOREST


presented by group 4
WHAT IS KNN?
Simple and powerful supervised learning algorithm that can be
used for both classification and regression tasks. It's a non-
parametric algorithm, meaning it doesn't make any assumptions
about the underlying data distribution.
KNN CLASSIFICATION CONCEPT
K nearest neighbors stores all variable cases and
classifies new case based on a similarity measure
(distance function)

It relies on the idea that similar data points tend to


have similar labels or values.

An object (a new instance) is classified by a


majority votes for its neighbor classes.

The object then assigned to the most common


class among its K nearest neighbors.
HOW KNN WORKS ? we will now make a circle with BS as the center just
as big as to enclose only three data points on the
plane.
Let’s take a simple case to understand this
algorithm. Following is a spread of red circles (RC)
and green squares (GS)

The three closest points to BS are all RC. Hence,


You intend to find out the class of the blue star with a good confidence level, we can say that the
(BS). BS can either be RC or GS and nothing else. BS should belong to the class RC.

The K value in KNN algorithm is the nearest Here, the choice became obvious as all three votes
neighbor we wish to take vote from. Let’s s assume from the closest neighbor went to RC. Thus, the
K = 3. choice of the parameter K is very crucial in this
algorithm.
HOW DO WE CHOOSE K VALUE
According to the example given before, the 6 training observation remain constant
and the different boundaries of K value.
These decision boundaries will segregate RC from GS.

If we watch carefully, you can see that the boundary


becomes smoother with increasing value of K.
With K increasing to infinity it finally becomes all blue or
all red depending on the total majority.
This makes the story more clear. At K=1, we were
overfitting the boundaries. Hence, error rate initially
decreases and reaches a minima.

P/s - Larger K works well but too


Large K may include majority point
from other classes
KNN IN LIFE SCENARIOS
Classifying objects in images such as cars, cats,
01 - IMAGE RECOGNITION dogs, etc.

Predicting the 3D structure of proteins based


02 - SCIENCE & ENGINEERING on their amino acid sequence

Identifying spam emails based on their text


03 - SPAM FILTERING content and sender information.
WHAT IS RANDOM FOREST ?
Random forest or Random Decision Forest is a method that
operates by constructing multiple Decision Trees during training
phase.
RANDOM FOREST CONCEPT
Imagine a forest with many trees. Each tree
represents a decision tree, a simple model that
makes predictions based on a series of YES/NO The Random Forest
questions about the data features. classification go through
from the high entropy to
the lower entropy

The decision of the


majority of the tree is
chosen by the random
forest as the final
decision
HOW RANDOM FOREST WORKS
Using the illustration of the random forest
concept, let’s break it down how it’s works.
Features from actual data

Let’s say we want to classify the unknown fruit, Color Diameter

To apply random forest, a Bootstrap


red 1
Aggregation is use in random forest algorithm.

Here the bootstrap sample is taken from actual yellow 3


data (tree 1, tree 2, tree 3) with a replacement
which means there is a high possibility that
each sample won’t contain unique data.

Bootstraping
HOW RANDOM FOREST WORKS
The tree 1 and 3 classifies it as an orange while tree 2 classifies it as a cherry.

Here, the majority of the tree in random forest is orange. Thus, the fruit that
we want to classifies earlier is orange.
WHY RANDOM
FOREST ?

NO OVERFITTING HIGH ACCURACY ESTIMATES MISSING DATA

Use of multiple Runs efficiently on Random Forest can


trees reduces the large database maintain accuracy
risk of overfitting when a large
For large data, it proportion of data is
Training time is less produce highly missing
accurate predictions
THANK YOU

You might also like