You are on page 1of 5

K-Nearest Neighbour (KNN) Algorithm for Machine

Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that seems the most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that seems much similar to the new data.

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbours


o Step-2: Calculate the Euclidean distance of K number of neighbours
o Step-3: Take the K nearest neighbours as per the calculated Euclidean distance.
o Step-4: Among these k neighbours, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbour is maximum.
o Step-6: Our model is ready.

1
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method
that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric
variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent
(y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is changing according to the value of the independent
variable.

The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε

Here,

Y=Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input
value).
ε = random error

The values for x and y variables are training datasets for Linear
Regression model representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear Regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.

Assumptions of Linear Regression


1.Linear relationship between the features and target

2.Small or no multicollinearity between the features

3.Normal distribution of error terms

4.No autocorrelations

2
Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations. The process of finding
the best model out of various models is called optimization. It can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and independent variables on a scale
of 0-100%.
o The high value of R-square determines the less difference between the predicted values and actual
values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determination for multiple
regression.
o It can be calculated from the below formula:

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the
answer (Yes/No), it further split the tree into subtrees.

3
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

Assumptions
o There should be some actual values in the feature
variable of the dataset so that the classifier can
predict accurate results rather than a guessed
result.
o The predictions from each tree must have very low
correlations.

The Working process can be explained in the below steps and diagram:

o Step-1: Select random K data points from the training set.


o Step-2: Build the decision trees associated with the selected data points (Subsets).
o Step-3: Choose the number N for decision trees that you want to build.
o Step-4: Repeat Step 1 & 2.
o Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

4
5

You might also like