You are on page 1of 5

K-Means Clustering

K Means clustering is an unsupervised learning algorithm that attempts to


divide our training data into k unique clusters to classify information. This
means this algorithm does not require labels for given test data. It is
responsible for learning the differences between our data points and
determine what features determining what class.

Supervised vs Unsupervised Algorithm


Up until this point we have been using supervised machine learning
algorithms. Supervised learning means that when we pass training data to
our algorithm we also pass the estimated values or classes for each of
those data points. For example, when we were classifying the safety of cars
we gave the algorithm the features of the car and we told it if the car was
safe or not. If we were using an unsupervised algorithm we would only pass
the features and omit the class.

How K-Means Clustering Works


The K-Means clustering algorithm is a classification algorithm that follows
the steps outlined below to cluster data points together. It attempts to
separate each area of our high dimensional space into sections that
represent each class. When we are using it to predict it will simply find what
section our point is in and assign it to that class.
Step 1: Randomly pick K points to place K centroids
Step 2: Assign all of the data points to the centroids by distance. The closest
centroid to a point is the one it is assigned to.
Step 3: Average all of the points belonging to each centroid to find the
middle of those clusters (center of mass). Place the corresponding centroids
into that position.
Step 4: Reassign every point once again to the closest centroid.
Step 5: Repeat steps 3-4 until no point changes which centroid it belongs to.

Visual Example
This example will be done in 2D space with K=2. Meaning two features and two classes.
The data we will use is seen below

Step 1: Two random centroids are created and placed on the graph.

Step 2: Each point is assigned to a centroid.


Step 3: Each centroid is moved into the center of each cluster of points

Step 4: The points are reassigned to the closest centroids.

Step 5: Since each point is already assigned to the closest centroid we do not need to repeat
this anymore and our algorithm has finished.

Implementing K Means Clustering


For this tutorial we will implement the K Means algorithm to classify hand
written digits. Like the last tutorial we will simply import the digits data set
from sklean to save us a bit of time.

Importing Modules
Before we can begin we must import the following modules.
import numpy as np
import sklearn
from sklearn.preprocessing import scale
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn import metrics

Loading the Data-set


We are going to load the data set from the sklean module and use the scale
function to scale our data down. We want to convert the large values that
are contained as features into a range between -1 and 1 to simplify
calculations and make training easier and more accurate.
digits = load_digits()
data = scale(digits.data)
y = digits.target

k = 10
samples, features = data.shape
We also define the amount of clusters by creating a variable k and we
define how many samples and features we have by getting the data set
shape.

Scoring
To score our model we are going to use a function from the sklearn website.
It computes many different scores for different parts of our model. If you’d
like to learn more about what these values mean please visit the
following website .
def bench_k_means(estimator, name, data):
estimator.fit(data)
print('%-9s\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, estimator.inertia_,
metrics.homogeneity_score(y, estimator.labels_),
metrics.completeness_score(y, estimator.labels_),
metrics.v_measure_score(y, estimator.labels_),
metrics.adjusted_rand_score(y, estimator.labels_),
metrics.adjusted_mutual_info_score(y, estimator.labels_),
metrics.silhouette_score(data, estimator.labels_,
metric='euclidean')))

Training the Model


Finally to train the model we will create a K Means classifier then pass that
classifier to the function we created above to score and train it.
clf = KMeans(n_clusters=k, init="random", n_init=10)
bench_k_means(clf, "1", data)

MatplotLib Visualization Example


To see a visual representation of how K Means works you can copy and run
this code from your computer. It is from the SkLearn documentation . It looks
something like this:

You might also like