Classification

CLASSIFICATION
K-NEAREST NEIGHBOR & RANDOM FOREST

presented by group 4
WHAT IS KNN?
Simple and powerful supervised learning algorithm that can be
used for both classification and regression tasks. It's a non-
parametric algorithm, meaning it doesn't make any assumptions
about the underlying data distribution.
KNN CLASSIFICATION CONCEPT
K nearest neighbors stores all variable cases and
classifies new case based on a similarity measure
(distance function)
It relies on the idea that similar data points tend to

have similar labels or values.
An object (a new instance) is classified by a

majority votes for its neighbor classes.
The object then assigned to the most common

class among its K nearest neighbors.
HOW KNN WORKS ? we will now make a circle with BS as the center just
as big as to enclose only three data points on the
plane.
Let’s take a simple case to understand this
algorithm. Following is a spread of red circles (RC)
and green squares (GS)
The three closest points to BS are all RC. Hence,

You intend to find out the class of the blue star with a good confidence level, we can say that the
(BS). BS can either be RC or GS and nothing else. BS should belong to the class RC.
The K value in KNN algorithm is the nearest Here, the choice became obvious as all three votes
neighbor we wish to take vote from. Let’s s assume from the closest neighbor went to RC. Thus, the
K = 3. choice of the parameter K is very crucial in this
algorithm.
HOW DO WE CHOOSE K VALUE
According to the example given before, the 6 training observation remain constant
and the different boundaries of K value.
These decision boundaries will segregate RC from GS.
If we watch carefully, you can see that the boundary

becomes smoother with increasing value of K.
With K increasing to infinity it finally becomes all blue or
all red depending on the total majority.
This makes the story more clear. At K=1, we were
overfitting the boundaries. Hence, error rate initially
decreases and reaches a minima.
P/s - Larger K works well but too

Large K may include majority point
from other classes
KNN IN LIFE SCENARIOS
Classifying objects in images such as cars, cats,
01 - IMAGE RECOGNITION dogs, etc.
Predicting the 3D structure of proteins based

02 - SCIENCE & ENGINEERING on their amino acid sequence
Identifying spam emails based on their text

03 - SPAM FILTERING content and sender information.
WHAT IS RANDOM FOREST ?
Random forest or Random Decision Forest is a method that
operates by constructing multiple Decision Trees during training
phase.
RANDOM FOREST CONCEPT
Imagine a forest with many trees. Each tree
represents a decision tree, a simple model that
makes predictions based on a series of YES/NO The Random Forest
questions about the data features. classification go through
from the high entropy to
the lower entropy
The decision of the

majority of the tree is
chosen by the random
forest as the final
decision
HOW RANDOM FOREST WORKS
Using the illustration of the random forest
concept, let’s break it down how it’s works.
Features from actual data
Let’s say we want to classify the unknown fruit, Color Diameter
To apply random forest, a Bootstrap

red 1
Aggregation is use in random forest algorithm.
Here the bootstrap sample is taken from actual yellow 3

data (tree 1, tree 2, tree 3) with a replacement
which means there is a high possibility that
each sample won’t contain unique data.
Bootstraping
HOW RANDOM FOREST WORKS
The tree 1 and 3 classifies it as an orange while tree 2 classifies it as a cherry.
Here, the majority of the tree in random forest is orange. Thus, the fruit that
we want to classifies earlier is orange.
WHY RANDOM
FOREST ?
NO OVERFITTING HIGH ACCURACY ESTIMATES MISSING DATA
Use of multiple Runs efficiently on Random Forest can

trees reduces the large database maintain accuracy
risk of overfitting when a large
For large data, it proportion of data is
Training time is less produce highly missing
accurate predictions
THANK YOU

Classification

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Classification

Uploaded by

Copyright:

Available Formats

CLASSIFICATION

K-NEAREST NEIGHBOR & RANDOM FOREST

It relies on the idea that similar data points tend to

An object (a new instance) is classified by a

The object then assigned to the most common

The three closest points to BS are all RC. Hence,

If we watch carefully, you can see that the boundary

P/s - Larger K works well but too

Predicting the 3D structure of proteins based

Identifying spam emails based on their text

The decision of the

Let’s say we want to classify the unknown fruit, Color Diameter

To apply random forest, a Bootstrap

Here the bootstrap sample is taken from actual yellow 3

NO OVERFITTING HIGH ACCURACY ESTIMATES MISSING DATA

Use of multiple Runs efficiently on Random Forest can

You might also like