You are on page 1of 9

CHAPTER 7 :

CLUSTERING
DR AZLIN AHMAD
CONTENT

 K-Means
 K-Nearest Neighbour
WHAT IS CLUSTERING?

 Clustering:-
 a process of grouping similar objects into groups called clusters
 The clusters resemble the hidden patterns in the data set
 widely used in numerous applications
 pattern recognition, data analysis, image processing, life sciences etc

 traditional clustering algorithms, among the most popularly


used
 K-Means, K-Nearest Neighbors, Kohonen Self Organizing
Maps (KSOM), Hierarchical Clustering etc
 Clustering analysis is used to gain some valuable insights
from our data by seeing what groups the data points fall into
when we apply a clustering algorithm.
K-MEANS CLUSTERING

 K-Means is probably the most well know clustering algorithm.


 Commonly used in solving various clustering problems in many areas.
 K-Means has the advantage that it is pretty fast, as all we’re really doing is computing the distances between
points and group centers; very few computations
 K-Means has a couple of disadvantages.
 Have to select how many groups/classes there are
 starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the
algorithm
HOW?
select a number of classes/groups to use and randomly initialize
their respective center points

Each data point is classified by computing the distance between


that point and each group center, and then classifying the point to
be in the group whose center is closest to it.

Based on these classified points, we recompute the group center


by taking the mean of all the vectors in the group.

Repeat these steps for a set number of iterations or until the group
centers don’t change much between iterations. You can also opt to
randomly initialize the group centers a few times, and then select
the run that looks like it provided the best results.
K-NEAREST NEIGHBOR

 The KNN algorithm assumes that similar things


exist in close proximity.
 In other words, similar things are near to each
other.
 Notice in the image above that most of the
time, similar data points are close to each other.
 KNN captures the idea of similarity (sometimes
called distance, proximity, or closeness) with
some mathematics we might have learned in
our childhood— calculating the distance
between points on a graph.
HOW?

Initialize K to your chosen


Load the data For each example in the data:
number of neighbors
Calculate the distance between the query
example and the current example from
Sort the ordered collection of the data.
Pick the first K
distances and indices from
entries from the
smallest to largest (in Add the distance and the index of the example
sorted collection
ascending order) by the to an ordered collection
distances

Get the labels of the If regression, return the mean of the K labels
selected K entries

If classification, return the mode of the K labels


ADVANTAGES & DISADVANTAGES

 Advantages
 The algorithm is simple and easy to implement.
 There’s no need to build a model, tune several parameters, or make additional assumptions.
 The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
 Disadvantages
 The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.

You might also like