Professional Documents
Culture Documents
KNN stands for K-Nearest Neighbors. KNN is a machine learning algorithm used for classifying
data. Rather than coming up with a numerical prediction such as a students grade or stock
price it attempts to classify data into certain categories based upon certain features. In the
next few tutorials we will be using this algorithm to classify cars in 4 categories based upon
certain features.
We can clearly see that there are groups of the same class points clustered
together. Because of this correlation we can use the K-Nearest Neighbors
algorithm to make accurate classifications.
There ara three clases red, blur and Green…
Example
Let's have a look at the following example.
We want to classify the black dot as either red, blue or green. Simply
looking at the data we can see that it clearly belongs to the red class, but
how do we design an algorithm to do this?
First, we need to define a K value. This is going to be how many points we
should use to make our decision. The K closest points to the black dot.
Next, we need to find out the class of these K points.
Finally, we determine which class appears the most out of all of our K
points and that is our prediction.
In this example our K value is 3.
The 3 closest points to our black dot are the ones that have small black dots
on them.
All three of these points are red, therefore red occurs the most. So we
classify the black dot as apart of the red group.
If you'd like to see some more examples with K > 3 and more difficult cases please
watch the video.
Summary
The K-Nearest Neighbor algorithm is very good at classification on small
data sets that contain few dimensions (features). It is very simple to
implement and is a good choice for performing quick classification on small
data. However, when moving into extremely large data sets and making a
large amount of predictions it is very limited. Think the algoritm calculate
distances to the nearest points.
Downloading the Data
The data set we will be using is the Car Evaluation Data Set from the UCI
Machine Learning Repository . You can download the .data file below.
Download Data: Download Now
*IMPORTANT* If you choose to download the file from the UCI website
yous must make the following change (if you clicked the download button it
has been done for you). This is because Pandas read dthe first line as tags
for the features-
CHANGE: Add the following line to the top of your file and click save.
buying,maint,door,persons,lug_boot,safety,class
Your file should now look like the following:
Importing Modules
Before we start we need to import a few modules. Most of these should be
familiar to you. The only one we have yet to import is the following:
from sklearn import preprocessing
Loading Data
After placing our car.data file into our current script directory we can load
our data. To load our data we will use the pandas module like seen in
previous tutorials.
data = pd.read_csv("car.data")
print(data.head()) # To check if our data is loaded correctly
y = list(cls) # labels
Finally we will split our data into training and testing data using the same
process seen previously.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size
= 0.1)
To train our model we follow precisely the same steps as outlined earlier.
And once again to score our model we will do the following.
Also chage the numeb of neighbors to increase accuracy. Start from 5….
# This will display the predicted class, our data and the actual class
# We create a names list so that we can convert our integer predictions into
# their string representation
Looking at Neighbors
The KNN model has a unique method that allows for us to see the neighbors
of a given data point. We can use this information to plot our data and get a
better idea of where our model may lack accuracy. We can
use model.neighbors to do this.
Note: the .neighbors method takes 2D as input, this means if we want to
pass one data point we need surround it with [] so that it is in the right
shape.
Parameters: The parameters for .neighbors are as follows: data(2D array), #
of neighbors(int), distance(True or False)
Return: This will return to us an array with the index in our data of each
neighbor. If distance=True then it will also return the distance to each
neighbor from our data point.