Professional Documents
Culture Documents
1. Features:
a. KNN is an supervised learning algorithm. It is simplest learning algorithm.
b. Also called as voting classifier.
c. It either uses Euclidian equation or Manhattan Distance to find distance between points.
d. Distance based & non - probabilistic algorithm.
e. It is slowest algo bcz it takes a lot of time in training & bcz it takes so much time calculating distance
between training point with the testing point always.
f. KNN can be used for both categorical or Discrete data/ target variable.
g. Voting classifier is used to devote class to testing values.
h. It is a non-parametric algorithm which means it does not make any assumptions on underlying data.
2. Formulas:
a. Formula for Euclidian distance:
+
3. Difference:
a. while the Euclidian dist is about finding the hypotenuses using Pythagoras formula, Manhattan dist is about
just adding the 2 adjacent sides of the Pythagoras triangle.
i.
4. Scaling:
1. However for discrete values, data scaling is essential bcz KNN is a distance based algorithm.
2. Methods of Scaling:
i. Standardization (Standard Scalar)
ii. Normalization. (Min-Max scalar)
iii. Robust Scalar
3. Scaling is used:
i. bcz every variable may hv different unit & KNN does not take units into consideration. This may
result into a biased model.
ii. There may b a huge difference in the values of x coordinates & y coordinates.
iii. In case where the distance is huge. Obviously the time complexity for it to calculate distance will
increase tremendously.
iv. Impact of outliers will reduce by using scaling.
Algorithms Page 1
iv. Impact of outliers will reduce by using scaling.
v. If the values in any column are big then such column will carry more weightage. In such a case even
if such a column is not imp still it may be given undue importance by the model. This is unwanted.
vi. Scaling helps in fixing up weights.
5. Normalization scaling:
i. A commonly used scaling technique wherein the values are shifted & re-scaled such that they end up
ranging between 0-1. It is also known as MinMax scaling.
7. Standardization Scaling:
Standardization (also called, Z-score normalization) is a scaling technique such that when it is applied
the features will be rescaled within range -1 to 1 so that they’ll have the properties of a standard
normal distribution with mean, μ=0 and standard deviation, σ=1;
Algorithms Page 2
3. Predictive power.
12. KNN satisfies above 3 parameters to good extent.
b. Disadvantages:
1) the time consumption also increases. So time complexity becomes a disadvantage here in KNN.
2) Also it does not do well on imbalanced data set.
3) Unlike other algo it stores all the training data so uses more memory.
4) For k < 5, it can be too Noisy.
5) The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
6) Inefficient and slow when the dataset is large. As for the cost of the calculation,
the distance between the new point and train points is high.
7) Doesn’t work well with high dimensional data because it becomes harder to find
the distance in higher dimensions.
8) Sensitive to outliers, as it is easily affected by outliers.
9) Cannot work when data is missing. So data needs to be manually imputed to
make it work.
10) Needs feature scaling/normalization
Assumptions:
14. The data is in feature space, which means data in feature space can be measured by distance
metrics such as Manhattan, Euclidean, etc.
15. Each of the training data points consists of a set of vectors and a class label associated with
each vector.
16. Desired to have ‘K’ as an odd number in case of 2 class classification.
17. The algorithm assumes that every feature has the same magnitude & direction however in reality every
dimension may hv labels associated & may vary in magnitude & direction.
19. Applications:
a. Facial recognition - used along with CNN in computer vision.
Algorithms Page 3
who like each item—if a similar set of users like two different items, then the items themselves are
probably similar!
• This applies to recommending products, recommending media to consume, or even ‘recommending’
advertisements to display to a user!
20. Whether scaling is required?
a. YES!
21. Impact of missing values-
a. KNN is indeed used for imputing missing values.
22. Impact of outliers?
Algorithm is sensitive to outliers, since a single mislabeled example dramatically changes the class
boundaries. Anomalies affect the method significantly, because k-NN gets all the information from the
input, rather than from an algorithm that tries to generalize data.
Classification accuracy of the KNN algorithm is found to be adversely affected by the presence of outliers in
the experimental datasets.
23. Is it an overfitting or underfitting model?
It all depends upon choosing the right value for k. if value of k is too low, then it would lead to an overfit
model whereas a too high value may lead to underfit model.
The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead
to overfitting as well as a big value of k can lead to underfitting.
24. Library to import:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
Algorithms Page 4