DSAI Maam Notes 2

K-Nearest Neighbor(KNN) Algorithm
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based

on
Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to
the available categories
o K-NN algorithm stores all the available data and classifies data
point a new
based on the
similarity. This when
new data appears then it can be easily
means
classified into a well suite category by
using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification
problems.
o K-NN is a
non-parametric algorithm, which means it does not make any
assumption on underlying data.
It is also called
o a lazy
learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets
new data, then it classifies that data into a category that is much similar to the
new data.
o
Example: Suppose, we have an image of a creature that looks similar to cat and
dog. but we want to know either it is a cat or dog. So for this identification, we
can use the KNN atgorithm, as it works on a similarity measure. Our KNN model
will find the similar features of the new data set to the
cats and dogs images
and based on the most similar features it will put it in either cat or dog category.
Why K-NN Algorithm do we need ?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data
point x1, so this data point will lie in which of these categories. To solve this
we need a K-NN
type of problem,
algorithm. With the help of K-NN, we can easily identify the category or class
of a
particular dataset. Consider the below diagram:
Before KNN
After KNN
Category B
Category B
New data point

New data point
K-NN assigned to-
Category 1
Category A Category A
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
Step-1:Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o
Step-4: Among these k neighbors, count the number of the data points in each
category
o Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
o
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
Category B
New Data
point
Category A
Firstly, we will choose the

number of
o
Next, we will calculate the
neighbors, will choose the k=5.
so we
Euclidean distance between

Euclidean distance is the the data
points. The
distance between two
studied in geometry. It can points, which we have already
be calculated as:
B(XY)
Y2
Y ,A(X,Y)
X1 X2
Euclidean Distance between Ai and Ba =

X-X1)2+Y2-Y1)2
oBy calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
CategoryA neghboS
category B2neighbors
Category B
New New Data

point
Category A
ac we can see the 3 nearest
neighbors are from
point must belong to category A.
category A, hence this new data
How to select the value of Kin the K-NN

Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithrn:
o There is no particular way to determine the
best value for "K", so we need to
try
values to find the best out of them. The most
some
preferred value for K is 5.
oA very low value for K such as K=1 or
K=2, can be noisy and lead to the effects
of outliers in the model.
oLarge values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:

oIt is simple to implement.
oIt is robust to the
noisy training data
O It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:

Always needs to determine the value of K which may be
complex some time.
O The computation cost is
high because of calculating the distance between the
data points for all the training samples.
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solvin9
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.
The decisions or the test are performed on the basis of features of the given
dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called
decisian tree because, similar to a tree, it starts with the root node,
a
which expands on further branches and constructs a tree-like

structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) well
as as numeric
data.
Decision Node Root Node
ISub-Tree Decision Node

Decision Node
Leaf Node Leaf NodeILeaf Node Decision Node
Leaf Node Leaf Node
Why use Decision Trees?

There are various algorithms in Machine
learning, so choosing the best algorithm for
the given dataset and problem is the main
point
to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
oDecision Trees usually mimic human thinking ability while making a decision, so
it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It
which further gets divided into two or more
represents the entire dataset,
homogeneous
sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
.Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
.Branch/Sub Tree: A tree
formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
How does the Decision Tree
algorithm Work?
In decision tree, for predicting the class of the
a
from the root node of

given dataset, the algorithm starts
the tree. This algorithm compares the values of root attribute
With the record
(real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
For the next

node, the algorithm again compares the attribute value with the
sub-nodes and otheer
move further. It continues
the process until it reaches the leaf node of
the tree. The complete
process can be better understood using the below
algorithm:
Step-1: Begin the tree with the root
node, says S, which contains the
dataset. complete
o
Step-2: Find the best attribute
in the dataset
Measure (ASM). using Attribute Selection
Step-3: Divide the S into subsets that
contains possible values for the best
attributees.
o
Step-4: Generate the decision tree node,
which contains the best
o
Step-5: Recursively make new decision trees attribute.
created in step -3. Continue using the subsets of the dataset
this process until a
cannot further
classify the stage is reached where you
nodes and called the final
node as a leaf node.
Example: Suppose there is a candidate who has
whether he should
accept the offer
a
job offer and wants to
starts with the root node
or Not. So, to solve this decide
the Tiext (Salary attribute by ASM). The root problem, the decision tree
decision node (distance from node splits further
the office)
corresponding labels. The next decision node further and one leaf node based
into
on the
Cab facility) and one leaf node. gets split into one
Finally, the decision decision node
(Accepted offers and Declined node
offer). Consider the below splits into two leaf nodes
diagram:
Salary is between
S50000-$80000
Y e s
No
Office near to
home Declined
offer
Y e s
No
Provides Cab
Declined
facility
offer
Y e s
No
Accepted Declined
offer
Offer
Attribute Selection Measures

While implementing a Decision
tree, the main issue arises that how to
attribute for the root node and for select the best
sub-nodes. So, to solve such
technique which is called as Attribute selection problems there is a
measure or ASM.
measurement, we can easily select the best attribute for By this
are two
the nodes of the tree. There
popular techniques for ASM, which are:
oInformation Gain
Gini Index
1. Information Gain:
o Information gain is the measurement of

changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature
provides us about a class.
According to the value of information gain, we split the node and build the
decision tree.
o
Adecision tree algorithm always tries to maximize the value of information gain,
and a node/attribute
having the highest information gain is split first. It can be
calculated using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) "Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy (s)= - P ( y e s )l o g 2 P (yes) - P(no) log2 P(no)
Where,
oS= Total number of samples

oPyes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be
preferred as compared to the
high Gini index.
o Itonly creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1-
2P
Pruning: Getting an Optimal Decision tree

Pruning is
process of deleting the unnecessary nodes from
a
a tree in order to
optimal decision tree. get the
A too-large tree increases the risk of

the important features of the dataset.
overfitting, and a small tree may not capture al
of the learning tree without
Therefore, a technique that decreases the size
reducing accuracy is known as Pruning. There are
two types of tree
pruning technology
used: mainly
o CostComplexity Pruningg
o Reduced Error Pruning.
Advantages of the Decision Tree
o It is
simple to understand as it follows the same process which a human follow
while making any decision in real-life.
It can be very useful for
solving decision-related problems.
It helps to think about all the
possible outcomes for a problem.
There is less requirement of data cleaning compared to other
algorithms.
Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex
o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
oo For more class labels, the computational complexity of the decision tree may
increase.

DSAI Maam Notes 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSAI Maam Notes 2

Uploaded by

Copyright:

Available Formats

K-Nearest Neighbor(KNN) Algorithm

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based

Why K-NN Algorithm do we need ?

New data point

How does K-NN work?

Firstly, we will choose the

Euclidean distance between

Euclidean Distance between Ai and Ba =

New New Data

How to select the value of Kin the K-NN

Advantages of KNN Algorithm:

Disadvantages of KNN Algorithm:

which expands on further branches and constructs a tree-like

ISub-Tree Decision Node

Leaf Node Leaf NodeILeaf Node Decision Node

Leaf Node Leaf Node

Why use Decision Trees?

Decision Tree Terminologies

from the root node of

For the next

Attribute Selection Measures

o Information gain is the measurement of

Entropy (s)= - P ( y e s )l o g 2 P (yes) - P(no) log2 P(no)

oS= Total number of samples

Pruning: Getting an Optimal Decision tree

A too-large tree increases the risk of

Disadvantages of the Decision Tree

You might also like