You are on page 1of 9

Introduction to KNN

KNN stands for K-Nearest Neighbors. KNN is a machine learning algorithm used for classifying
data. Rather than coming up with a numerical prediction such as a students grade or stock
price it attempts to classify data into certain categories based upon certain features. In the
next few tutorials we will be using this algorithm to classify cars in 4 categories based upon
certain features.

For the official SkLearn KNN documentation click here .


https://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

How Does K-Nearest Neighbors Work?


In short, K-Nearest Neighbors works by looking at the K closest points to
the given data point (the one we want to classify) and picking the class that
occurs the most to be the predicted value. This is why this algorithm
typically works best when we can identify clusters of points in our data set
(see below).

We can clearly see that there are groups of the same class points clustered
together. Because of this correlation we can use the K-Nearest Neighbors
algorithm to make accurate classifications.
There ara three clases red, blur and Green…

Example
Let's have a look at the following example.
We want to classify the black dot as either red, blue or green. Simply
looking at the data we can see that it clearly belongs to the red class, but
how do we design an algorithm to do this?
First, we need to define a K value. This is going to be how many points we
should use to make our decision. The K closest points to the black dot.
Next, we need to find out the class of these K points.
Finally, we determine which class appears the most out of all of our K
points and that is our prediction.
In this example our K value is 3.
The 3 closest points to our black dot are the ones that have small black dots
on them.
All three of these points are red, therefore red occurs the most. So we
classify the black dot as apart of the red group.
 
If you'd like to see some more examples with K > 3 and more difficult cases please
watch the video.

Limitations and Drawbacks


Although the KNN algorithm is very good at performing simple classification
tasks it has many limitations. One of which is its Training/Prediction Time.
Since the algorithm finds the distance between the data point and every
point in the training set it is very computationally heavy. Unlike algorithms
like linear regression which simply apply a function to a given data point the
KNN algorithm requires the entire data set to make a prediction. This means
every time we make a prediction we must wait for the algorithm to compare
our given data to each point. In data sets that contain millions of elements
this is a HUGE drawback.
 
Another drawback of the algorithm is its memory usage. Due to the way it
works (outlined above) it requires that the entire data set be loaded into
memory to perform a prediction. It is possible to batch load our data into
memory but that is extremely time consuming.

Summary
The K-Nearest Neighbor algorithm is very good at classification on small
data sets that contain few dimensions (features). It is very simple to
implement and is a good choice for performing quick classification on small
data. However, when moving into extremely large data sets and making a
large amount of predictions it is very limited. Think the algoritm calculate
distances to the nearest points.
Downloading the Data
The data set we will be using is the Car Evaluation Data Set  from the UCI
Machine Learning Repository . You can download the .data file below.
Download Data:  Download Now
*IMPORTANT* If you choose to download the file from the UCI website
yous must make the following change (if you clicked the download button it
has been done for you). This is because Pandas read dthe first line as tags
for the features-
CHANGE: Add the following line to the top of your file and click save.
buying,maint,door,persons,lug_boot,safety,class
Your file should now look like the following:
Importing Modules
Before we start we need to import a few modules. Most of these should be
familiar to you. The only one we have yet to import is the following:
from sklearn import preprocessing

This will be used to normalize our data and convert non-numeric


values into numeric values.
Now our imports should include the following.
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing

Loading Data
After placing our car.data file into our current script directory we can load
our data. To load our data we will use the pandas module like seen in
previous tutorials.
data = pd.read_csv("car.data")
print(data.head()) # To check if our data is loaded correctly

Converting Data no numerical to numerical


As you may have noticed much of our data is not numeric. In order to
train the K-Nearest Neighbor Classifier we must convert any string data into
some kind of a number. Luckily for us sklearn has a method that can do this
for us.
We will start by creating a label encoder object and then use that to encode
each column of our data into integers.

The method fit_transform() takes a list (each of our columns) and will


return to us an array containing our new values.
Viendo los dos arrays data y buying, vemos que a vhigh le da valor 3, low =
1, de 0 a 3.. Ahora necesitamos recombinar los datos pues ls tenemos en
matrices separadas.
Now we need to recombine our data into a feature list and a label list. We
can use the zip() function to makes things easier. OJO EN ESTE CASO
USAMOS UNA LIST, NO UNA MATRIZ.

X = list(zip(buying, maint, door, persons, lug_boot, safety)) # features

y = list(cls) # labels

Finally we will split our data into training and testing data using the same
process seen previously.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size
= 0.1)

No usar más de 0.2 para no empeorar el rendimiento.

VER COMO HACERLO CON MATRICES?

Training a KNN Classifier


Creating a KNN Classifier is almost identical to how we created the linear
regression model. The only difference is we can specify how many
neighbors to look for as the argument n_neighbors.

To train our model we follow precisely the same steps as outlined earlier.
And once again to score our model we will do the following.

Also chage the numeb of neighbors to increase accuracy. Start from 5….

Testing Our Model


If we'd like to see how our model is performing on the unique elements of
our test data we can do the following.

# This will display the predicted class, our data and the actual class
# We create a names list so that we can convert our integer predictions into
# their string representation

Our output should look like the following.

Looking at Neighbors
The KNN model has a unique method that allows for us to see the neighbors
of a given data point. We can use this information to plot our data and get a
better idea of where our model may lack accuracy. We can
use model.neighbors to do this.
Note: the .neighbors method takes 2D as input, this means if we want to
pass one data point we need surround it with [] so that it is in the right
shape.
Parameters: The parameters for .neighbors are as follows: data(2D array), #
of neighbors(int), distance(True or False)
Return: This will return to us an array with the index in our data of each
neighbor. If distance=True then it will also return the distance to each
neighbor from our data point.

Our output should now be a mess that looks like this.

You might also like