You are on page 1of 4

NOTES: KNN K-Nearest Neighbors …

Monday, April 12, 2021 6:41 PM

1. Features:
a. KNN is an supervised learning algorithm. It is simplest learning algorithm.
b. Also called as voting classifier.
c. It either uses Euclidian equation or Manhattan Distance to find distance between points.
d. Distance based & non - probabilistic algorithm.
e. It is slowest algo bcz it takes a lot of time in training & bcz it takes so much time calculating distance
between training point with the testing point always.
f. KNN can be used for both categorical or Discrete data/ target variable.
g. Voting classifier is used to devote class to testing values.
h. It is a non-parametric algorithm which means it does not make any assumptions on underlying data.
2. Formulas:
a. Formula for Euclidian distance:

b. Formula for Manhattan distance:

+
3. Difference:
a. while the Euclidian dist is about finding the hypotenuses using Pythagoras formula, Manhattan dist is about
just adding the 2 adjacent sides of the Pythagoras triangle.

i.

b. Hyper-Parameter (p) for Euclidian is 2 & tht for Manhattan is 1.


c. Manhattan distance will always be greater than or equals to Euclidian distance.

4. Scaling:
1. However for discrete values, data scaling is essential bcz KNN is a distance based algorithm.
2. Methods of Scaling:
i. Standardization (Standard Scalar)
ii. Normalization. (Min-Max scalar)
iii. Robust Scalar

3. Scaling is used:
i. bcz every variable may hv different unit & KNN does not take units into consideration. This may
result into a biased model.
ii. There may b a huge difference in the values of x coordinates & y coordinates.
iii. In case where the distance is huge. Obviously the time complexity for it to calculate distance will
increase tremendously.
iv. Impact of outliers will reduce by using scaling.

Algorithms Page 1
iv. Impact of outliers will reduce by using scaling.
v. If the values in any column are big then such column will carry more weightage. In such a case even
if such a column is not imp still it may be given undue importance by the model. This is unwanted.
vi. Scaling helps in fixing up weights.

4. Types of feature scaling:


i. MinMax Scalar(Normalization)
ii. Std scalar (Standardization)
iii. Robust scalar
iv. Max Abs
v. Unit Vector scalar
vi. Power transformation.
vii. Log Transform
viii. Box cox transform

5. Normalization scaling:
i. A commonly used scaling technique wherein the values are shifted & re-scaled such that they end up
ranging between 0-1. It is also known as MinMax scaling.

6. Formula for Normalization:

7. Standardization Scaling:
Standardization (also called, Z-score normalization) is a scaling technique such that when it is applied
the features will be rescaled within range -1 to 1 so that they’ll have the properties of a standard
normal distribution with mean, μ=0 and standard deviation, σ=1;

8. Formula for Standardization:

5. Most imp factor is to find the best value for K.


6. If the count of neighbors for two classes is same pertaining to a perticular testing point then whichever class has
more values that will predominate over other. Into such class, the point will be classified .
7. We can get mean & standard deviation values in df.info()
8. Parameter CV is used for cross validation. Higher the CV, better the accuracy.
9. GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through
10. predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select
the best parameters from the listed hyperparameters.
Standardization is another scaling technique where the values are centered around zero as the mean with a unit
standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a
unit standard deviation.
11. For evaluating any technique, we look for 3 aspects:
1. Easy to interpret output.
2. Calculation time.
3. Predictive power.

Algorithms Page 2
3. Predictive power.
12. KNN satisfies above 3 parameters to good extent.

13. Pros & Cons:


a. Advantages: While with the increase in data the accuracy increases.
1) No assumptions required.
2) The algorithm is simple and easy to implement.
3) There’s no need to build a model, tune several parameters, or make additional assumptions.
4) The algorithm is versatile. It can be used for classification, regression, and search (as we will
see in the next section).
5) Easy to understand, implement, and explain.
6) Is a non-parametric algorithm, so does not have strict assumptions.
7) No training steps are required. It uses training data at run time to make
predictions making it faster than all those algorithms that need to be trained.
8) Since it doesn’t need training on the train data, data points can be easily added.
9) It does not matter to knn even if you add any data at any random point of time.
Bcz knn computes the distance every time whereas Decision Tree, even if one
observation is added the whole tree can change.
10) KNN is used for imputation & in smotting as well.

b. Disadvantages:
1) the time consumption also increases. So time complexity becomes a disadvantage here in KNN.
2) Also it does not do well on imbalanced data set.
3) Unlike other algo it stores all the training data so uses more memory.
4) For k < 5, it can be too Noisy.
5) The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
6) Inefficient and slow when the dataset is large. As for the cost of the calculation,
the distance between the new point and train points is high.
7) Doesn’t work well with high dimensional data because it becomes harder to find
the distance in higher dimensions.
8) Sensitive to outliers, as it is easily affected by outliers.
9) Cannot work when data is missing. So data needs to be manually imputed to
make it work.
10) Needs feature scaling/normalization

Assumptions:
14. The data is in feature space, which means data in feature space can be measured by distance
metrics such as Manhattan, Euclidean, etc.
15. Each of the training data points consists of a set of vectors and a class label associated with
each vector.
16. Desired to have ‘K’ as an odd number in case of 2 class classification.
17. The algorithm assumes that every feature has the same magnitude & direction however in reality every
dimension may hv labels associated & may vary in magnitude & direction.

18. Types of Problems it can solve(Supervised)


1. Regression & Classification

19. Applications:
a. Facial recognition - used along with CNN in computer vision.

b. Used for imputation & in smotting.


• If you’re searching for semantically similar documents (i.e., documents containing similar topics), this is
referred to as Concept Search.
• The biggest use case of k-NN search might be Recommender Systems. If you know a user likes a particular
item, then you can recommend similar items for them. To find similar items, you compare the set of users
who like each item—if a similar set of users like two different items, then the items themselves are

Algorithms Page 3
who like each item—if a similar set of users like two different items, then the items themselves are
probably similar!
• This applies to recommending products, recommending media to consume, or even ‘recommending’
advertisements to display to a user!
20. Whether scaling is required?
a. YES!
21. Impact of missing values-
a. KNN is indeed used for imputing missing values.
22. Impact of outliers?
Algorithm is sensitive to outliers, since a single mislabeled example dramatically changes the class
boundaries. Anomalies affect the method significantly, because k-NN gets all the information from the
input, rather than from an algorithm that tries to generalize data.
Classification accuracy of the KNN algorithm is found to be adversely affected by the presence of outliers in
the experimental datasets.
23. Is it an overfitting or underfitting model?
It all depends upon choosing the right value for k. if value of k is too low, then it would lead to an overfit
model whereas a too high value may lead to underfit model.
The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead
to overfitting as well as a big value of k can lead to underfitting.
24. Library to import:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)

Algorithms Page 4

You might also like